Last week I encountered one of the more bizarre bugs of my career. Following the release, gaps started appearing on charts. The strange thing was that the data was all in the database; it just wasn’t coming through the user interface. Unfortunately, for this release, rolling back wasn’t really an option, so we needed to quickly identify and correct the source of the problem.

After spending the better part of a day banging our heads against the problem, we had isolated it to a few statements. The code had changed in this area in the last release (one of the advantages of weekly releases is that it is easier to pinpoint the source of a problem as each release contains only one week’s worth of changes) but not in a way that should have caused the problem we were seeing. And everything ran correctly in the test environment. It seemed like the problem could be environment-related.

We didn’t have the necessary hooks and logging to properly exercise the problem area in isolation, but after a quick patch release, we did (another benefit of zero-downtime deployment is that it allows greater flexibility with the time and frequency of deployment). After the patch deploy, we were able to see that the production system returned one fewer result than the corresponding functionality run against the test environment. If the system was supposed to return only one result, the production system was returning none – which explained the gaps on the chart. The problem was clearly at a SQL driver or database level.

Just prior to the release, the MySQL driver had been upgraded to the latest version (5.1.10). We had been running this version in the test environment for several weeks without issue, so it seemed odd that it could be the source of the problem. The version of the database, however, was inconsistent between the two environments (MySQL 5.1.39 in production vs 5.1.36 in test). The newer database version had been running fine in the production environment for several days and hadn’t been touched with the release, so it seemed odd that this could be the source of the problem. This was enough to go from, however, and we reverted the version of the MySQL driver on the web servers which ended up fixing the problem.

Later, while trolling the MySQL release logs, I came across the nasty bug that bit us. Evidently, the 5.1.10 driver had changed the format of dates in a way that triggered this bug in 5.1.39. So it was the combination of these two version of the software that caused the problem – each in isolation worked fine. The issue has been fixed in MySQL 5.1.41, but it’s a pretty serious bug in core SQL functionality to come out of a sanctioned release.

Coming out of our 5 Whys analysis, the experience points to several ways that we need to tighten up our process as well.

  1. We need to separate environment changes from software releases. Even seemingly innocuous changes can have repercussions when combined with other systems. Keeping environment changes separate will make it easier and faster to pinpoint the source of problems (ie. is it an environment problem or a software problem?).
  2. We need to narrow the discrepancies between the test and production environments. Environment discrepancies are a common source of unexpected risk. While it’s not feasible to keep the environments perfectly in sync, this could be better. Ironically, it was an attempt to make the environments more consistent that caused this problem.
  3. We need a more comprehensive set of automated validation tests that we can run against the system subsequent to deployment that verifies the integrity of the release. This is one of the tasks for this week.

While these actions may seem fairly obvious, it often (unfortunately) takes getting bitten by these types of bugs to illuminate areas where we need to improve. Have you faced something similar? What additional actions do you take to prevent these types of environment problems?

Last week, I spoke at the Agile Vancouver monthly meeting on the subject of *surprise* Deploying to Production Every Week. Eugene was kind enough to video the talk; so if you’re interested in catching a replay, it’s available on Vimeo in five 20 minute segments. Here’s the first video as a teaser:

Deploying to Production Every Week (Owen Rogers), part 1 of 5 from Agile Vancouver on Vimeo.

Yesterday we deployed 19 times to our test environment (about once for every 4 commits). We were busy making some final refinements to a new feature that we launched last night. We had most of the team working on this feature (Single Feature Release) and the rapid feedback of the regular deploys helped ensure that everything was coming together for the release.

There is no way that we could have maintained this pace if we were taking down the test system for 10 minutes with each deploy. We needed to be able to continuously test the system throughout. This to me is one of the unsung values of zero-downtime deployment. Everyone normally focuses on the benefit of your users not seeing a fail whale page if they hit your site during a deploy. But let’s face it: unless you’re practicing continuous deployment, you’re deploying to test much more often than you are deploying to production. And the frequency with which we can cycle through and get feedback on changes in the test environment is one of the primary limiting factors determining how frequently we can release the software.

While many people that I’ve talked with like the idea of zero-downtime deployment, they have little access, control or influence over the production environment. So my advice is to start with the environments that you own, like test. Get it set up and running solidly there first. This greatly lowers the barrier to then rolling it out to production. And, as I describe above, the benefits of having zero-downtime in test are substantial.

In my next post, I’ll try to provide some more technical details about our zero-downtime set up.

I arrived home last night after a quick whirlwind trip to India for the Codechef conference. With three tech talks in three cities in three days, it didn’t leave much time for sightseeing. But I did have a few days at the end in Bangalore to catch up with friends and former colleagues.

Here are the slides from the presentation. They did evolve a bit over the course of the tech talks (and hopefully improve).

The presentation also stimulated some good side chatter on twitter. In general, zero-downtime database deployment, continuous monitoring and WAGMI seemed to be popular topics. Thanks to everyone who made it out and contributed. Your feedback has been very helpful in refining the presentation.

Also, a big thanks to Naresh and Amit for organizing the event and ensuring that we were well taken care of.

Speaking in India

7 Sep 2009 In: speaking

This week, I’m off to India to speak at the CodeChef conference. The conference consists of three talks over three days in three different Indian cities: Mumbai, Hyderabad and Bangalore. I will be speaking alongside Lisa Crispin and Bhavin Turakhia. My talk (unsurprisingly) will be on weekly releases and will be quite similar to the presentation that I gave at Devteach.

I am really looking forward to returning to India for this trip and catching up with friends and former colleagues there. I will be in Mumbai on Wednesday and Thursday, Hyderabad on Friday, and Bangalore on Saturday through Monday. So if you are around and want to head out for maharashtrian food or chaat, MTR or kamat, or whatever local delicacies they have in Hyderabad, please let me know. Also, a big thanks to my friend Naresh Jain for organizing the whole thing.

I’m a big fan of CSS – it keeps things looking consistent, it separates structure and design, and it keeps markup clean, simple and maintainable. The greatest strength and weakness of stylesheets is their scope. A large number of pages are typically styled by a single stylesheet. This is great for consistency and reuse, but it means that it can be difficult to assess the impact of changing a style without verifying every page that uses it – in every supported browser! Changing a global style could break the layout somewhere in the site in ways that could easily go unnoticed.

Hence, when releasing software to production every week, the cost of making style changes can be prohibitive. It is difficult to regression test all impacted pages in all browser combinations within a reasonable amount of time. Another challenge is that in CSS there are many ways to achieve the same thing, though each approach could render differently in different browsers. And in a site that is changing rapidly, existing designs need to evolve to incorporate new features, usability improvements, user feedback and better ways of doing things.

Stylesheets can’t be seen as being too risky to change. Otherwise they will be subverted through local inline styles and other workarounds (the same goes for common javascript libraries or pretty much any shared component for that matter). While local styles may seem expedient for that particular page or feature, they only introduce inconsistencies and make the site more difficult to maintain in the long term. So what’s a web dev to do?

We’ve developed an approach to help deal with this problem. I accept that there are probably smarter ways to achieve this – if you have a better idea, please let me know. Here goes:

  1. All styles should be relative to some top-level class. For example:
    form.standard { margin: 0px 150px; }
    The class describes the type of component that is being styled – the type of table, form, or component we are designing here. In the example, we are defining a “standard” form, which is a particular type of form. Using a top-level class allows for other types of forms to be styled as required.
  2. All sub-elements are defined relative to the top-level class:
    form.standard label { margin-left: -150px; }
    All labels for a “standard” form have a negative left margin. It is acceptable (even preferable) to style bare tags as long as the style is relative to a top-level class. This keeps the markup simple and consistent. We only add CSS classes to child elements as required.
  3. The corollary to the above is that it is not acceptable to style bare tags (unless we are doing a style reset). The problem with styling bare tags directly is that it locks us in to one specific style and that makes it difficult to evolve the design of the site. For example:
    label { font-weight: bold }
    means that all form labels will be in bold. If we introduce a form later that shouldn’t have bold labels, we will have to explicitly override this style (which tends to be brittle and limiting).
  4. Now, if we need to evolve an existing style, we have a few options. Say we want to replace the “standard” form with a new design. Rather than change the styles for the “standard” form class directly, which would immediately impact every page where this style is used, we can incrementally and selectively rollout the new design on a page-by-page basis. If the new style is significantly different from the initial style, we can fork the original style by defining a new top-level class:
    form.danger { background-color: red; }
    Simply by changing the class attribute for each form element from “standard” to “danger”, we can then roll out the new style to select pages testing the design in all supported browsers as we go. Think of this being like continuous integration for site design.
  5. If the style changes are relatively minimal, we can override the style for specific elements. One way to achieve this is to use multiple top-level CSS classes. For example, we could incrementally apply both the “standard” and the “danger” classes to the form elements on each page testing as we go. The “danger” class could override styles as required – though dealing with precedence can be tricky. Alternately, the new class could be defined relative to a top-level identifier. This solves the precedence problem as styles defined relative to an identifier take precedence over styles that are relative to a class. Another option is to define specific classes for the child elements to be restyled – but this means changing a lot more markup during the rollout.

That’s it. It seems pretty simple – intuitive even. But I haven’t found many references to how others tackle this problem. Again, if you have a better idea, please let me know.

Frequent Releases Reduce Risk: Talk at VanDev

22 Jul 2009 In: agile

Last week I delivered a presentation at the Vancouver Software Developer Network meetup on the relationship between risk and frequent releases. In the presentation I proposed that building the capability to release software frequently (daily or weekly) actually reduces risk and that concerns about frequent releases are founded on a localized understanding of risk. You can find the slidecast for the presentation below.

My intention with the presentation was to lay out the basis for an argument that would be subsequently debated. This proved to be more challenging than my last presentation on frequent releases which more of an experience report. It’s difficult to clearly convey the layers of a logical argument through a presentation. I’m impressed by how lawyers manage to do it.

One of the things that I asked attendees to do was to list the top three things that are preventing them from releasing software every week. I’ve compiled their responses in the chart below:

The results are hardly scientific, but I was heartened to see substantial overlap between the concerns that I addressed in the presentation and those identified by the audience. It turned out quite a large contingent of the audience are rich client developers, which brings its own shared of deployment headaches. Many are also working in regulated industries that require their software to be submitted to third party certification agencies for review. To those that attended the session, thanks for participating and for sharing your concerns with me.

We have just sent out the call for session proposals for the Agile Vancouver 2009 conference. I’m really excited about the great list of speakers that we already have confirmed to come to this year’s conference, including:

  • Arlo Belshee
  • Eric Evans
  • Michael Feathers
  • Martin Fowler
  • Michael Hugos
  • Michael Nygard
  • Jeff Patton
  • Mary Poppendieck
  • Linda Rising
  • Johanna Rothmann

We are looking for speakers and we welcome you to send in your session proposals or experience reports. If you are interested in presenting at this year’s conference, please send your session proposal to Proposals should include:

  • session title and short description (300 words or less)
  • speaker bio
  • speaker photo
  • learning objectives: what do you intend for participants to learn from attending your session?
  • logistics considerations: what special equipment (whiteboards, print-outs, easels, etc) do you require?
  • email address that you would like us to use for all conference-related correspondence

The deadline for submitted proposals is September 2nd and we will announce selected proposals by September 17th.

Last week, our site sustained a prolonged outage during core business hours. While testing their backup power systems, our data centre provider tripped a breaker leading to a cascade of failures that, among other things, produced a power surge that fried our hardware firewall’s power supply. The hardware firewall is one of those standard pieces of system hardware that are so simple that they are assumed to be failure resistant – one of the pieces of an infrastructure least likely to fail. The reality is that they are antiquated, commodity hardware that the host provider has long ago paid off and that have sustained the load of numerous sites before ours. The question is not if they’re going to fail, but when. And the implications of their failure is quite severe.

By design, the firewall serves as the single access point into and out of the site. Even though we had taken redundancy and failover measures in the web server, application server and database clusters behind the firewall, it doesn’t matter much if the traffic can’t get though. Essentially the hardware firewall is a big old SPOF.

Normally a hardware power supply is one of those things that a data centre can very quickly replace. But when the data centre itself is in turmoil because of a significant outage, replacing a power supply for some small customer is the last thing on their mind. When it comes down to it, the only one who cares about your site is you and your customers. We, of course, knew about the failure immediately because of the monitoring we have in place. But there wasn’t much we could do. When managed hardware fails, there’s not much you can do except log a ticket (assuming that the ticketing system is up – which in this case it wasn’t), sit back and wait. Of course there are SLAs in place (there’s a one hour replacement window on these types of things), but they don’t keep your site up and going through the negotiations to sort out the ramifications of a failure are a big waste of everyone’s time. The bottom line is that we need to eliminate this SPOF from our infrastructure by obtaining a secondary firewall that we can failover to.

Speaking at DevTeach Vancouver

10 Jun 2009 In: agile, event

I’ll be speaking this Thursday at DevTeach Vancouver about our experiences doing weekly production deployments. Some topics that I will cover:

As I haven’t been doing .NET development in over a year, I feel like a bit of an impostor at the conference. However, I think that the ideas and experiences of short release cycles transcend technology. I also think that there’s a lot that the Java and .NET communities can learn from each other.

Here are the slides: