Ganglia
Last week, we got Ganglia running in test and production. If you aren’t familiar with Ganglia or don’t have something like it monitoring your site then check it out. It is, frankly, amazing. Commonly deployed in the HPC space, Ganglia is used by many large, high-traffic sites to monitor their server farms.

Ganglia, using a daemon process running on each server, grabs vital system metrics and transmits them (via multicast) to an aggregation RRD-backed service. It provides a simple PHP web application for viewing charts. For a great example of Ganglia in action, check out Wikipedia’s Ganglia instance.

Contrary to what you might expect from a HPC tool, Ganglia is very easy to get up and going. By default, it automatically starts capturing all of the key metrics (CPU, memory, I/O) for your server. Ganglia also supports aggregating data into clusters; in production we have set up one cluster for our app servers and one for the database, as they tend to have different load profiles. We have a similar set configuration in our test environment so that we can spot load problems before they get deployed into prod. Ganglia is also the cornerstone of our capacity planning (more on this in a subsequent post).

As Ganglia is backed by RRD, it can be used to capture any type of system or application metric (any time series data more precisely). We are in the process of configuring it to capture JVM and message queue statistics, and other application-level metrics.

Log Analysis Service
For application-level monitoring, we have implemented a custom log analysis service. Due to the proliferation of logging frameworks in the Java space, we use SLF4J to aggregate logs from log4j, commons logging and java.util.logging into a single log stream sent over a socket to a simple log analysis service. This service uses logback to filter logging messages and transmit severe or warning messages to our support mailing list. This gives us a pretty good sense of when application arise and how to track them down. We run the same service in our test environment to proactively find issues before the get into production. It has been absolutely invaluable in helping us detect and analyze any problems with our system.

Nagios
We have also set up Nagios for real-time production monitoring and alerting. We started with Nagios prior to getting going with Ganglia, but after having it running for awhile in our test environment, we realized that the system-level alerts that we were receiving from Nagios were mainly noise without the context and trends in system performance. As Ganglia nicely fills this need, we’ve put Nagios on the backburner.

Pingdom
For simple web-level monitoring and uptime statistics gathering we’re using Pingdom. Pingdom provides a reliable and relatively inexpensive service for routinely pinging pages within our site on a configurable interval from its different servers around the world.

We’re continuing to add new monitoring to our site, but I feel that what we have set up now provides a pretty solid foundation for keeping our system running reliably.