A software self healing pattern (part 1)

Over the last 18 months I've been working with a 24x7 manufacturing group, and no matter what I say, they always have the same two requests/demands:

  1. The software system must not fail, and
  2. If it does fail for some reason, it needs to be able to recover properly from the failure.

Simply put, (a) the machines must keep moving, and (b) nobody wants the phone call in the middle of the night when the machines stop moving.

Software self healing pattern, part 3

Software healing component #3: The self healing program

In the first two parts of this tutorial I wrote about a worker program, and the functionality it needs to provide to support a self-healing architecture. In the second part I wrote about an open source program named Nagios, which can be used to monitor your worker programs. In this third part of the self-healing recipe, I will now write about what is necessary for the "healing" component.

Software self healing pattern, part 2

Component #2: The software monitoring program

The second piece of the self-healing architecture is to have a program that monitors your worker programs. You can write your own software monitoring program, as I did many years ago, but that's pretty old-school, and frankly, not a good idea any more.