Process Monitoring with ps-watcher
originally posted on Linux.com September 28, 2008.
You can monitor your computers in a wide variety of ways. Large proprietary applications make sense for large installations that can afford the expense of both the software and consultants who fine-tune the systems. Open source monitoring solutions like Nagios or OpenNMS cost nothing to acquire but still require planning and tweaking. When you need to address smaller problems with process data on a system, the process monitoring tool ps-watcher comes in handy.
ps-watcher is an example of a Unix tool that does one thing and does it well. It allows you to access all the process information on a system and make choices based on that information. ps-watcher provides a consistent interface (with some caveats) to the varied process information available on different Unix and Linux machines.
Although ps-watcher may not come with your system, installing it is straightforward. Download the source code and do the usual
configure; make; make install routine. You don't have to install or run ps-watcher as root, although depending on the actions you want to perform you may need to either run it as root or use sudo. Any user can run ps-watcher for simple monitoring and alerting. Because it doesn't have any prerequisites, ps-watcher should install and work on just about any Unix-like system.
Once you have ps-watcher installed the next step is to create a configuration file that contains the rules your want to use for monitoring or acting on what ps-watcher reports, based on the output of the ps command. Along with the supplied configuration file, you can use command-line options to fine-tune the behavior of ps-watcher and assist in debugging.
A simple example
ps-watcher can stop machines from crashing due to memory exhaustion, or make sure that a buggy program isn't leaving too many copies of itself running. One basic use for ps-watcher is ensuring that a certain program is running. Say that you want to make sure your CentOS Linux server always has one copy of the Network Time Protocol daemon ntpd running to maintain accurate time. Keeping ntpd running can be a challenge; for example, it exits on startup if the system time is too far from the reference time. One way to keep ntpd running is to use ps-watcher to ensure that at least one ntpd process is running at all times. Create a ps-watcher.cfg file with these contents:
[ntpd] occurs = none action = /etc/init.d/ntpd restart
The first line, in brackets, is a regular expression that is matched against the
cmd field in the ps output. Note that because of the way ps-watcher is written you can't match on the same process name more than once in a configuration file, even though you may want to if you want to both monitor that a particular process is running and perform an action if the process meets your criteria; we'll see how that works in a moment. The workaround for this case is to match on slightly different regular expressions, such as "ntpd" or "ntp." The second line of the configuration file (
occurs=none) file tells ps-watcher to execute the
action line if no process name matches are made. The
occurs line controls how often the action line gets executed.
none means execute the action if no matches are found, while
every means execute the action for every match. See the ps-watcher man page for other possible values.
To use this configuration, start ps-watcher (as root) and specify the configuration file with the command
ps-watcher --config ps-watcher.cfg. ps-watcher automatically daemonizes itself and re-runs on a regular basis (every five minutes by default). To ensure that ps-watcher is always running, you can add a line to /etc/rc.local to start it on every boot.
A more complex example
The power of ps-watcher is its flexibility. You can define any condition to match and execute any action when a match is found (or not found). This makes ps-watcher an excellent all-purpose system monitoring tool. For example, I was recently struggling with a FreeBSD computer where a program's child processes would occasionally fail to exit. As a result the system load would slowly increase over time until the system stopped responding. The permanent fix would probably involve a combination of upgrades to fix the buggy software. However, since this was a production server, I needed a Band-Aid that would prevent the system from failing until I could schedule downtime for software upgrades.
My naive first guess was that I could monitor the amount of CPU time consumed by the broken child processes and kill any that exceeded an arbitrary limit of one hour. Unfortunately this didn't work because the parent and child processes have the same name. The parent process was always running, and thus quickly accumulated more than one hour of CPU time and got killed by ps-watcher.
I then realized I could match on both the process id (PID) and the parent process id (PPID). Since the master process is a background daemon it always has init for a parent (PPID of 1). The child processes would have a PPID of the master process. That gave me the following ps-watcher configuration file (assume the master and child processes are named "foo"):
[foo] occurs = every trigger = elapsed2secs('$time') > 1*HOURS && $ppid != 1 action = <<EOT echo "$command accumulated too much CPU time" | /bin/mail user\@host kill -TERM $pid EOT [foo?] occurs = none action = /usr/local/etc/foo restart
This configuration uses several advanced ps-watcher features. It combines two trigger conditions, including the ps-watcher built-in function elapsed2secs. The action line uses a shell script here-document to perform multiple actions. Finally, the example includes an additional check that at least one master process is always running and restarts it if none is found.
The flexibility of ps-watcher can lead to some frustration. For instance, because the action line is interpreted in Perl, you have to properly quote special characters. In my above example I had to put a backslash in front of the '@' in the mail address; otherwise, Perl would eat it. Read the documentation (the man page) carefully before using ps-watcher so you understand other limitations, such as the fact that you can't match on the same command more than once.
Luckily most ps-watcher issues can be resolved by turning on its debugging mechanism. You can use the following command to launch ps-watcher in debug mode:
ps-watcher --debug 1 --nodaemon --log --config ps-watcher.cfg
This will stop ps-watcher from going into the background on startup and will print debug messages directly on your console. That should be enough to isolate and fix any problems with your configuration. Things like incorrect quoting in your configuration file will often lead to warnings in the debug output.
Remember that ps-watcher directly presents the native ps output, and this varies substantially between different operating systems, such as FreeBSD and Linux. The command
ps-watcher --help will display the list of valid variables for your platform. One way this confused me initially was that on FreeBSD
$time is the accumulated CPU time for a process, while on Linux the accumulated CPU time is
$bsdtime. At first glance on Linux it might seem that
$etime is the equivalent of the FreeBSD
$time, but it actually the elapsed wall time since the process started.
When you just need to fix one or two small problems, a comprehensive system monitoring tool can be overkill. When you need to monitor and react to process information on a machine, ps-watcher is a good tool to reach for first.\