Hello readers of 2coffee.dev, it's been a while since we last met. A week or two ago, I encountered quite an interesting problem while deploying a system. I initially thought of not writing it down, but then I realized that someone might face the same situation, so I diligently wrote it out. It's also a record to remember and share with everyone.
The system I am in charge of has a rather old service that was deployed based on pm2 using GCP's VM infrastructure. It's called old because it has been running for a long time, even before I took over, and there haven't been any updates since. All functions are in a stable phase, only maintaining for a certain user base. It wouldn't be a problem if the number of users had not suddenly increased recently, or perhaps due to some reason, the number of users with complex logic occasionally causes the system to become overloaded. CPU spikes, RAM increases to a certain level... Boom! The server crashes.
This VM is only allocated a modest amount of resources: 1 CPU and 2GB RAM. So when the CPU or RAM suddenly spikes, it will freeze without being able to SSH into it. Realizing the issue, I immediately set out to find a solution. Initially, I could upgrade the server resources, but practical experience has shown that this is ineffective; the server still "hangs" at an unpredictable time. I can't fix the error right away because resources are limited and we still have many other tasks that need more priority. At this point, the most feasible thing I could think of was to limit the resource usage for this service.
Fortunately, pm2 has a memory usage limitation feature. When this limit is set, each time the process uses memory up to the limit, it automatically restarts to free up memory. Memory overflow is very dangerous in a VM because it causes the server to freeze, making it very difficult to perform any operations, including SSH into the server to troubleshoot.
The setup is very simple. Just run a command.
pm2 start api.js --max-memory-restart 300M
With --max-memory-restart
being the memory limit. Every 30 seconds, pm2 will scan and restart the service if needed.
I thought that limiting memory would solve everything, but upon further monitoring, another problem arose: CPU also spiked.
PM2 does not have the feature to limit CPU resources for a service. If you want to impose a limit, you need to find another tool or supporting tool. For example, if using Docker, there are already resource configuration settings available. Very convenient. After a while of searching, I found cpulimit, which is a standalone tool that helps limit CPU resources for a process.
Each service in pm2 runs in a process. When you type pm2 ls
, you will see a column titled PID, corresponding to the Process ID of that service. When using the command ps -fp PID
, you will see detailed information of the process.
Using cpulimit is relatively simple. After installation, use the command.
$ cpulimit -p PID -l 80 -b
With PID
being the PID of the process to limit, -l
being the maximum CPU level, and -b
to run the process in the background. cpulimit keeps CPU usage from exceeding the established limit, thus during peak hours, the server may process slower than usual.
I thought that after setting both limits, I could sleep well, but no, a new problem arose.
Each time pm2 restarts the service, the PID of the process changes. Normally, one would try to fix the PID, but that is impossible as it is allocated randomly. cpulimit can be configured by PID, but it can also be configured based on a few criteria like the executable file path; however, none of my trials were successful. Just when I thought I was at an impasse, I remembered that pm2 has an advanced feature called PM2 API.
The PM2 API is a set of APIs from pm2 that allows interference with this process management tool. One of its capabilities is to listen to events of processes running on pm2. Simply put, it can be considered as a hook. Each emitted event can be listened to and execute related tasks. Applied to this case, each time the service restarts, listen and rerun the cpulimit
command to set the limit again.
The implementation is straightforward; readers can refer to the js
file I wrote as follows.
const pm2 = require("pm2");
const { spawn } = require("child_process");
const fs = require("node:fs");
const PM_CONFIGURATIONS = [{ pm_id: 1, cpu_limit: "80" }];
pm2.connect((err) => {
if (err) {
console.error("PM2 connect error:", err);
process.exit(2);
}
pm2.launchBus((err, bus) => {
console.log("PM2 launchBus");
if (err) {
console.error("PM2 launchBus error:", err);
process.exit(2);
}
bus.on("process:event", (data) => {
// Only consider start or restart events
if (!["start", "restart", "online"].includes(data.event)) return;
let pid = null;
const { pm_id, name } = data.process;
pid = data.process.pid;
if (!pid) {
// Get pid from pm_pid_path log file
const pm_pid_path = data.process.pm_pid_path;
const pm_pid = fs.readFileSync(pm_pid_path, "utf8");
pid = pm_pid;
}
console.log(`Event=${data.event} name=${name} pm_id=${pm_id} pid=${pid}`);
// Find corresponding configuration
const config = PM_CONFIGURATIONS.find((config) => config.pm_id === pm_id);
if (config) {
// Apply cpulimit if configuration found
console.log(`→ Applying cpulimit ${config.cpu_limit}% for PID=${pid}`);
spawn("cpulimit", ["-p", pid, "-l", config.cpu_limit, "-b"]);
} else {
// Do nothing if configuration not found
console.log(`→ Skipping pm_id=${pm_id}`);
}
});
});
});
PM_CONFIGURATIONS
contains configuration information of services to be listened to so that each time it restarts, it executes to find the new assigned PID for it and uses the cpulimit
command to limit CPU.
Through this article, I have shared how to optimize and control resource usage of services on PM2 in a resource-limited environment. First, limiting memory with the --max-memory-restart
parameter helps minimize the risk of memory overflow and server crashes, ensuring the service automatically restarts when necessary. However, when the issue of high CPU usage arises, an additional solution is to use the cpulimit
tool to limit CPU usage for each specific process. Nevertheless, the changing PID each time the service restarts presents a new challenge.
To overcome this, I utilized the PM2 API to automatically listen for events such as service start or restart, thus updating the PID and reassigning cpulimit
automatically. This is not only a practical approach but also a useful suggestion for those facing similar issues. I hope this article will help you in managing resource-related issues on pm2.
The secret stack of Blog
As a developer, are you curious about the technology secrets or the technical debts of this blog? All secrets will be revealed in the article below. What are you waiting for, click now!
Subscribe to receive new article notifications
Comments (0)