Last week I encountered one of the more bizarre bugs of my career. Following the release, gaps started appearing on charts. The strange thing was that the data was all in the database; it just wasn’t coming through the user interface. Unfortunately, for this release, rolling back wasn’t really an option, so we needed to quickly identify and correct the source of the problem.

After spending the better part of a day banging our heads against the problem, we had isolated it to a few statements. The code had changed in this area in the last release (one of the advantages of weekly releases is that it is easier to pinpoint the source of a problem as each release contains only one week’s worth of changes) but not in a way that should have caused the problem we were seeing. And everything ran correctly in the test environment. It seemed like the problem could be environment-related.

We didn’t have the necessary hooks and logging to properly exercise the problem area in isolation, but after a quick patch release, we did (another benefit of zero-downtime deployment is that it allows greater flexibility with the time and frequency of deployment). After the patch deploy, we were able to see that the production system returned one fewer result than the corresponding functionality run against the test environment. If the system was supposed to return only one result, the production system was returning none – which explained the gaps on the chart. The problem was clearly at a SQL driver or database level.

Just prior to the release, the MySQL driver had been upgraded to the latest version (5.1.10). We had been running this version in the test environment for several weeks without issue, so it seemed odd that it could be the source of the problem. The version of the database, however, was inconsistent between the two environments (MySQL 5.1.39 in production vs 5.1.36 in test). The newer database version had been running fine in the production environment for several days and hadn’t been touched with the release, so it seemed odd that this could be the source of the problem. This was enough to go from, however, and we reverted the version of the MySQL driver on the web servers which ended up fixing the problem.

Later, while trolling the MySQL release logs, I came across the nasty bug that bit us. Evidently, the 5.1.10 driver had changed the format of dates in a way that triggered this bug in 5.1.39. So it was the combination of these two version of the software that caused the problem – each in isolation worked fine. The issue has been fixed in MySQL 5.1.41, but it’s a pretty serious bug in core SQL functionality to come out of a sanctioned release.

Coming out of our 5 Whys analysis, the experience points to several ways that we need to tighten up our process as well.

  1. We need to separate environment changes from software releases. Even seemingly innocuous changes can have repercussions when combined with other systems. Keeping environment changes separate will make it easier and faster to pinpoint the source of problems (ie. is it an environment problem or a software problem?).
  2. We need to narrow the discrepancies between the test and production environments. Environment discrepancies are a common source of unexpected risk. While it’s not feasible to keep the environments perfectly in sync, this could be better. Ironically, it was an attempt to make the environments more consistent that caused this problem.
  3. We need a more comprehensive set of automated validation tests that we can run against the system subsequent to deployment that verifies the integrity of the release. This is one of the tasks for this week.

While these actions may seem fairly obvious, it often (unfortunately) takes getting bitten by these types of bugs to illuminate areas where we need to improve. Have you faced something similar? What additional actions do you take to prevent these types of environment problems?