At my company, we use a form of Continuous Monitoring: every time our system logs a warning or an error we immediately receive an email identifying the source and nature of the problem. This allows us to respond rapidly to problems as they arise and gives us good visibility into the health of our system. Following the mantra of “do in test as is done in prod”, we have the same monitoring system set up in both environments to help us find issues in test before they find their way into production.

The downside to this level of monitoring is that it can amount to a lot of messages. Our challenge is to manage the signal-to-noise ratio so that:

  • we are only notified about things that require immediate action,
  • we don’t suffer from information overload; and
  • emails that matter aren’t buried under a bunch of emails that don’t.

As part of our 5 Whys activity for production issues, we have found that most production issues actually occurred first in test, but just went unnoticed. This provides a compelling reason to keep the signal ratio high in all environments. Any time that we find ourselves automatically archiving or filtering an alert indicates an opportunity for improvement.

We have found that refining and tuning these alert messages is an ongoing maintenance activity. As part of our weekly meeting, we try to select one message to clarify or dispatch each week. We have a script that trawls the support emails received in the past week and builds a pareto distribution of the number of messages by logger. This helps us decide where to focus our efforts and to quantify the impact of our actions on the volume of messages we receive.

Determining what kinds of things we need to be alerted about is difficult to assess in advance. Often things that we are concerned about when building a feature turn out to less important in production, and conversely, we miss things in development that turn out to be very important once real customers start using them. Fortunately, deploying every week gives plenty of opportunity for improvement. Also if a message is logged more frequently than intended, we only have to put up with it for a week before it can be rectified.

I should mention that we have a circuit breaker in place in the log monitor. We do not allow duplicate messages to be sent any more frequently than once per hour. (Relatively early on we managed to get temporarily blacklisted by a mail provider when an errant message was generated much too frequently).

In terms of managing the signal-to-noise ratio, I’ve found that there are a few broad categories of messages to deal with:

  • Message source: did the message originate in our code or in one of the libraries that we depend on? Clearly, warnings coming from our code are easier deal with than those from outside. I’ve been frustrated by the laissez-faire attitude that various open source Java frameworks take to logging errors and warnings. We use Apache CXF, and it generates over 10 severe messages with lengthy stacktraces every time the application starts up to inform us that JMS integration through JNDI is not enabled. WTF?!? Sometimes these messages can be controlled by setting custom log levels for specific loggers, but not always. And it typically feels a bit disconcerting to shut down logging just in case something important is missed.
  • System conditions: was the message generated during normal operations, during a shut down or a crash? I’ve found that systems tend to be very noisy during shutdown, but (perversely) pretty quiet during a crash. In the world of Java app servers where memory leaks across deployments are common, trying to quietly quiesce a server is a real challenge.

In the (enterprise) environments that I’ve worked in the past, there was very little interaction between development and operations. Logs were used only for analyzing severe production problems – generally after a severe system problem (a crash) or a user had reported a problem. The log files were poorly tuned for diagnosing problems and they tended to be full of junk – problems that no one had noticed or reported that may have been going on for months (or longer).

In contrast, the approach that we follow at my current company means we are able to use logs to proactively find and remedy problems. It requires effort to maintain a high signal-to-noise ratio, but it is very worthwhile.