Peripatetic thinking
At my company, we use a form of Continuous Monitoring: every time our system logs a warning or an error we immediately receive an email identifying the source and nature of the problem. This allows us to respond rapidly to problems as they arise and gives us good visibility into the health of our system. Following the mantra of “do in test as is done in prod”, we have the same monitoring system set up in both environments to help us find issues in test before they find their way into production.
The downside to this level of monitoring is that it can amount to a lot of messages. Our challenge is to manage the signal-to-noise ratio so that:
As part of our 5 Whys activity for production issues, we have found that most production issues actually occurred first in test, but just went unnoticed. This provides a compelling reason to keep the signal ratio high in all environments. Any time that we find ourselves automatically archiving or filtering an alert indicates an opportunity for improvement.
We have found that refining and tuning these alert messages is an ongoing maintenance activity. As part of our weekly meeting, we try to select one message to clarify or dispatch each week. We have a script that trawls the support emails received in the past week and builds a pareto distribution of the number of messages by logger. This helps us decide where to focus our efforts and to quantify the impact of our actions on the volume of messages we receive.
Determining what kinds of things we need to be alerted about is difficult to assess in advance. Often things that we are concerned about when building a feature turn out to less important in production, and conversely, we miss things in development that turn out to be very important once real customers start using them. Fortunately, deploying every week gives plenty of opportunity for improvement. Also if a message is logged more frequently than intended, we only have to put up with it for a week before it can be rectified.
I should mention that we have a circuit breaker in place in the log monitor. We do not allow duplicate messages to be sent any more frequently than once per hour. (Relatively early on we managed to get temporarily blacklisted by a mail provider when an errant message was generated much too frequently).
In terms of managing the signal-to-noise ratio, I’ve found that there are a few broad categories of messages to deal with:
In the (enterprise) environments that I’ve worked in the past, there was very little interaction between development and operations. Logs were used only for analyzing severe production problems – generally after a severe system problem (a crash) or a user had reported a problem. The log files were poorly tuned for diagnosing problems and they tended to be full of junk – problems that no one had noticed or reported that may have been going on for months (or longer).
In contrast, the approach that we follow at my current company means we are able to use logs to proactively find and remedy problems. It requires effort to maintain a high signal-to-noise ratio, but it is very worthwhile.
80% technical, 20% social change. This blog is dedicated to finding ways to sustainably release software more frequently.
Heather Regehr
June 2nd, 2009 at 12:00 pm
Hi Owen:
At my current employer we have a number of components that send email alerts when something is amiss.
This seems good but we suffer quite seriously with two problems resulting from this approach — both of which are longer term maintenance issues:
1. the email was sent to an individual as well as a distribution list; the individual left the company and the distribution list is not proactively managed. The end result, the alerts were lost, ignored and not effective
2. Over the course of time, the exact code that generates the error has been forgotten. No one still working here can tell me what the error message means or where in our infrastructure it is being generated. We ended up on an archeological expidition to find the stupid things.
I don’t think either of these issues are unexpected and, in fact, managing these issues is closely related the the main point of your post.
My comment simply re-enforces the need to maintain the alerts otherwise, you may end up like us where the email alerts have become part of our overwhelming technical debt.
PS: we are making terrific use our our logs to extract performance metrics about our site. HR
Heather
exortech
June 8th, 2009 at 10:04 pm
Hi Heather,
Thanks for sharing your experiences. I agree completely with the need to proactively manage the monitoring process:
Regarding log file analysis, it’s hugely useful feedback for to both development and the business about how the site is performing and how the users are using it.