Software self healing pattern, part 3

Software healing component #3: The self healing program

In the first two parts of this tutorial I wrote about a worker program, and the functionality it needs to provide to support a self-healing architecture. In the second part I wrote about an open source program named Nagios, which can be used to monitor your worker programs. In this third part of the self-healing recipe, I will now write about what is necessary for the "healing" component.

The self healing component

In every self-healing architecture I've worked on over the last nine months, in addition to deploying my worker programs on servers, I have also deployed a second, much smaller application: in its simplest form, this is a program that knows how to restart the worker.

For example, here's a Ruby script that I run under xinetd on a Linux server that provides an interface to restart one of my file-moving programs:

#!/usr/bin/ruby

#-------------------------------------------------------
# PROGRAM: command-listener.rb
# PURPOSE: Listen for remote commands, and execute them. 
#-------------------------------------------------------

# the commands we listen for
FOO_STOP_COMMAND  = 'foo-stop'
FOO_START_COMMAND = 'foo-start'
FOO_RESTART_COMMAND = 'foo-restart'

# the actions we allow

foo_stop_action  = '/opt/Foo/bin/stop.sh'
foo_start_action = '/opt/Foo/bin/start.sh'
foo_restart_action = '/opt/Foo/bin/restart.sh'

# main
args = gets.chomp.split

if args[0] == FOO_STOP_COMMAND
    system(foo_stop_action)
elsif args[0] == FOO_START_COMMAND
    system(foo_start_action)
elsif args[0] == FOO_RESTART_COMMAND
    system(foo_restart_action)
else
    # got something else; do nothing
    puts "Er, thanks for calling!"
end

exit

That script is pretty easy, isn't it? This program serves as a "command listener", listening for remote calls that tell it to stop, start, or restart the worker program. In this case I've created those commands as shell scripts named stop.sh, start.sh, and restart.sh, respectively. I won't show those scripts here, because they are specific to my applications, but I will say that they are less than five lines of code each, excluding comments.

On a Linux server, all you have to do to set up this script is:

  1. Install command-listener.rb wherever you want it to live.
  2. Configure inetd or xinetd to listen to a port, and run this script when a command is received on that port.
  3. Modify the /etc/services file to notify it that this service will be running on this port.

I don't have access to my Linux servers at this moment, but hopefully these steps will make sense to you, or to your Linux system administrator. They really are quite simple.

One major caveat

One major caveat here:

This is not a secure approach, so don't use it on the internet. The internet is a dangerous place to run public services on, so you really, really, really, want to take a much more secure approach.

The solution presented here is a demo, and may be suitable for some private servers not connected to the internet (such as those secured behind a firewall, on a secure network), but even then you can and should take other steps to tighten it down, such as limiting remote commands to only come from certain IP addresses, or using certificates on the clients and servers. In my case, my co-workers didn't think this was necessary, so we went with something close to what I showed above.

Web services and web applications

As a quick note, you can do the same thing with web applications and web services, and I have been doing this for years. Just create a "status" servlet (or JSP) that is capable of testing the entire system, and have it return a status code.

As a very simple example, your Nagios agent calls the servlet, and the servlet performs a simple query against your database(s). If everything runs fine, the servlet returns a status of zero. If something fails, it can return a different status and an error message. If the call really fails -- like something is completely wrong with your web container, well, that can also be handled by your Nagios agent, can't it?

Using a "restart" as a self-healing mechanism

To date I have always been able to use a software restart as a self-healing mechanism. I keep trying to come up with more complicated schemes, but in the end, a simple restart usually fixes the problems that are fixable.

I certainly recommend that you investigate your own problems to see if there is something you can do other than performing a restart, but there's no way I can tell you what to do. The list of problems in the world is infinite, and so is the list of possible solutions. :)

Software self healing - summary

In this series I've presented my recipe for a simple self healing software architecture. In short, all you have to do is:

  1. Develop your worker programs so they can provide status information to remote monitors.
  2. Deploy a second application with your worker program that is capable of "healing" the worker, often with something as simple as a software restart.
  3. Use something like Nagios to monitor your workers, and send a "healing" command when a worker has gone out of commission. The healing command I've shown has been a simple restart command, but because you have complete control, you can make your command as complicated and powerful as desired.

Resources

  1. Software self healing, part 1 (Introduction, and Worker Program)
  2. Software self healing, part 2 (Software Monitoring)
  3. Nagios
  4. Nagios plugin return codes