Hangwatch is a project mainly written in C and SHELL, it's free.
Triggers a system action if a user-defined loadavg is exceeded
[qanda] .Hangwatch FAQ
What does hangwatch
do?::
Hangwatch periodically polls /proc/loadavg
, and echos a
user-defined set of characters into /proc/sysrq-trigger
if a
user-defined load threshold is exceeded.
Why not just run a script from cron
?::
When the system is under heavy load, fork() calls may fail
or hang indefinitely, making cron jobs and shell scripts useless.
Wouldn't loadwatch
be a more accurate name?::
Yes, it would. Many hangs are actually the result of very
rapid load spikes which never show up in logs, because the
applications that would be logging the load spike cannot run.
Calling it hangwatch
convinces people to run it even if they
haven't seen load spikes with their hangs.
Why not use NMI watchdog?:: NMI watchdog and hangwatch complement each other. NMI watchdog will fire when the system is completely hung, but not when it is simply too bogged down to get any real work done. Hangwatch will fire when the system is too bogged down to get work done, but probably won't help for a hard lockup. System hangs that do not trigger either NMI watchdog or hangwatch are extremely rare in my experience.
How do I make hangwatch start at boot time?:: The RPM installs a SysV-style init script and enables boot-time startup. The RPM also starts hangwatch immediately unless the machine is kickstarting.
To stop the service:
What is an appropriate threshold for loadavg
?::
In general, the threshold should be at least 5X
the number of CPU cores shown in /proc/cpuinfo
.
Practically, the threshold should be high enough
that hangwatch does not trigger unless the system
is unresponsive or almost unresponsive due to high load.
How do I make hangwatch run with higher priority?::
Set the NICELEVEL
variable in /etc/sysconfig/hangwatch
.
See man nice
for details.
How do I make hangwatch run with realtime priority?::
Set the RTPRIO
variable in /etc/sysconfig/hangwatch
.
See man chrt
for details. This variable overrides
the nice level.
How do I make hangwatch run on a particular processor?::
Set the CPUS variable in /etc/sysconfig/hangwatch
.
See man taskset
for details.
The simplest way to force troublesome applications to other
processors is to boot with the kernel parameter isolcpus=0
. This
excludes CPU0 from the process scheduler so that applications
are scheduled on other CPUS only. This has effects on latency
and CPU service time, so consider the performance ramifications
before booting with the isolcpus parameter. Other ways to bind
processes to particular CPUS include taskset(1)
, numactl(8)
,
and cpusets (see scheduler_domains.txt
in the kernel-doc).
Where does the output of hangwatch go?::
This depends on your syslog configuration, but generally
it will at least go to the boot console (serial terminals
and netconsole help for recording this) and will also go to
/var/log/messages
if the system is responsive enough to write to
that file.
How do I create different triggers for different load thresholds?:: The default configuration shipped with the package activates a single instance of hangwatch:
OPTIONS="-s mpt -t 5 -i 1"
If you uncomment the second and third options, then restart hangwatch, you can check status such as:
How can I collect a vmcore with hangwatch?::
Put c
at the end of your string of sysrq characters.
The sysrq-c trigger causes a controlled kernel panic, which should
allow netdump, diskdump, kdump, etc. to work properly.
Where can I get the source for hangwatch?::
You can use git
to clone the hangwatch repo on github:
You can also download one version of the source from http://people.redhat.com/astokes/hangwatch/
Do I have to compile it myself?:: An RPM provides a compiled version of hangwatch and installs configuration files for you. You can use the SRPM to compile hangwatch for other architectures if desired. One version of the RPM can be downloaded from http://people.redhat.com/astokes/hangwatch/.
tito
, which is in Fedora. The latest version of tito is at
http://github.com/dgoodwin/titoWill you please add feature X to hangwatch?:: Maybe. Because of the circumstances under which it operates, hangwatch must always be a very lightweight program. After startup, it can't do anything that might have problems under heavy load. That said, there are lots of other interesting pseudofiles it could potentially work with. Patches are welcome.