Last week I encountered one of the more bizarre bugs of my career. Following the release, gaps started appearing on charts. The strange thing was that the data was all in the database; it just wasn’t coming through the user interface. Unfortunately, for this release, rolling back wasn’t really an option, so we needed to quickly identify and correct the source of the problem.
After spending the better part of a day banging our heads against the problem, we had isolated it to a few statements. The code had changed in this area in the last release (one of the advantages of weekly releases is that it is easier to pinpoint the source of a problem as each release contains only one week’s worth of changes) but not in a way that should have caused the problem we were seeing. And everything ran correctly in the test environment. It seemed like the problem could be environment-related.
We didn’t have the necessary hooks and logging to properly exercise the problem area in isolation, but after a quick patch release, we did (another benefit of zero-downtime deployment is that it allows greater flexibility with the time and frequency of deployment). After the patch deploy, we were able to see that the production system returned one fewer result than the corresponding functionality run against the test environment. If the system was supposed to return only one result, the production system was returning none – which explained the gaps on the chart. The problem was clearly at a SQL driver or database level.
Just prior to the release, the MySQL driver had been upgraded to the latest version (5.1.10). We had been running this version in the test environment for several weeks without issue, so it seemed odd that it could be the source of the problem. The version of the database, however, was inconsistent between the two environments (MySQL 5.1.39 in production vs 5.1.36 in test). The newer database version had been running fine in the production environment for several days and hadn’t been touched with the release, so it seemed odd that this could be the source of the problem. This was enough to go from, however, and we reverted the version of the MySQL driver on the web servers which ended up fixing the problem.
Later, while trolling the MySQL release logs, I came across the nasty bug that bit us. Evidently, the 5.1.10 driver had changed the format of dates in a way that triggered this bug in 5.1.39. So it was the combination of these two version of the software that caused the problem – each in isolation worked fine. The issue has been fixed in MySQL 5.1.41, but it’s a pretty serious bug in core SQL functionality to come out of a sanctioned release.
Coming out of our 5 Whys analysis, the experience points to several ways that we need to tighten up our process as well.
While these actions may seem fairly obvious, it often (unfortunately) takes getting bitten by these types of bugs to illuminate areas where we need to improve. Have you faced something similar? What additional actions do you take to prevent these types of environment problems?
80% technical, 20% social change. This blog is dedicated to finding ways to sustainably release software more frequently.