Last week, we needed to run a long-running migration as part of our weekly deployment. Our company deals with large volumes of time series data – the majority of which is managed through a single large table (we haven’t implemented data sharding yet). We needed to do some restructuring and reindexing of this table, which we knew would take some time to run. We also wanted to be able to perform this maintenance in a way that would minimize user impact and the possibility of downtime.
Since we started practicing zero-downtime database deployment, we have the freedom to decouple database expansion migrations from the actual release as we ensure that all expansion migrations retain backwards compatibility with the existing release. This helps mitigate some of the risk of releasing, and allows us to break up a release into a few manageable chunks. This week we chose to take the option of upgrading the database on Thursday night and then making the release on Friday.
Prior to running the migrations, we halted all writes to the large table by suspending data processing from the queues that feed data to it. This is a nice side-effect of using queues – they provide a safe lever to control or throttle the behaviour of the system during deployment and maintenance. While it wasn’t technically necessary to pause the queues during the migration, it did allow the migration to complete faster with less contention.
I would like to say that the migration went without a hitch; we did incur about 10 minutes of downtime during the table restructuring. In retrospect, we should have created a temporary table that we could have performed the migration on and then swapped it with the actual table once the restructuring and reindexing were complete. Suspending writes to the table would have been important here as it would have ensured that the actual and the temporary table would have stayed in sync.
Another hiccup that we encountered during the migration was that the remote connection to the production database server was severed while the migration was in progress (due to a timeout on the SSH connection). Oops. Fortunately, the connection was broken while executing the last (and longest running) step in the migration and it continued to run to completion on the database server. This did make us realize that – especially for long-running operations – running remotely is risky.
We currently run migrations against production by using Bering, the same migrations framework that we use to migrate development and test databases. This makes me appreciate why migrations frameworks like dbDeploy focus just on generating the migration SQL script – rather than executing it. Seasoned DBAs are likely savvy to this risk and want something that they can easily run directly on the database servers. I’m planning on making some extensions to Bering to support generating SQL scripts to package as part of our deployment.
Last week I made some modifications to our database migration framework (Bering) to support zero-downtime database deployment. My introduction to zero-downtime database deployment comes from Michael Nygard’s excellent book, Release It!.
Database migrations are one of the primary sources of planned outages during a system deployment. As we’re deploying to production every week, this is a big concern for us. In a scaled out web-based system, it is pretty easy to deploy a new version of the application without incurring downtime by adjusting the load balancer to pull servers out of the pool while they are getting upgraded. Applying schema changes to a central database is another matter.
Database migrations tend to introduce a chicken-or-the-egg-type problem: the database changes can’t be applied without breaking the existing version of the application and the new version of the application won’t work without the changes to the database. Either way, you’re faced with the prospect of all or part of the system being unavailable for the duration of the deployment. Not good.
Zero-downtime database deployment presents a way out of this conundrum. The idea is to separate database migrations into two sets of changes:
The expansion scripts are run at some point prior to upgrading the application and the contraction scripts are run once the system has been upgraded and considered stable. This produces a nice benefit of decoupling database migrations from application deployments. The expansion scripts could be run a day or more in advance of the application deployment at a time that is convenient for database changes. The contraction scripts could be run potentially days after the deployment once everything has been validated with the new release.
This approach also greatly simplifies the task of deployment rollbacks. As much as we try to ensure that all database changes are reversible by having ‘down’ scripts, rolling back database changes is rarely easy and can easily lead to lost or inconsistent data depending on the time elapsed between the migration and the rollback. With zero-downtime database deployment however, if a problem is discovered with the new version of the application either during or after the release, it is safe to rollback to the existing version of the application without needing to rollback the database changes as the expansion migrations are compatible with both versions of the system.
Supporting zero-downtime database deployment is as simple as having two schema version tables in your database – one for tracking the latest version of the expansion scripts applied and one for tracking the contraction script versions. Then it is just a matter of keeping separate database migrations folders for each type of script. I needed to make some extensions to Bering to support a configurable version table name and script folder location, but aside from that, it was pretty easy to get set up and going.
Applying zero-downtime database deployment is a bit of an experiment for us. I plan to report more as we get more experience with it.
We arrived back to Vancouver in December after six months working in China. There wasn’t much going on for ThoughtWorks in Vancouver when I got back, and so, after weighing my options, I decided to respectfully part ways with the company. My 5.5 years with ThoughtWorks were full of great opportunities, great people and great experiences; however, Vancouver is my home, and now with a family I wanted to travel less than ever. My last day at TW coincided with the annual ThoughtBoarder ski trip, and the company graciously offered to cover my trip out to Calgary to wrap up my contract. 3 fantastic days of skiing in Fernie was a great conclusion to a chapter in my life.
Before leaving TW, I used my employee health care spending account to get laser eye surgery. It was something that I had been considering for years, but finally had the opportunity and excuse to do. The surgery was quick and painless, and after two days I was able to see pretty much as well as I could before with my glasses. Having the ability to see naturally again without glasses has been so incredibly liberating and worthwhile.
After ThoughtWorks, I considered doing some local agile consulting and working on my nascent product idea; however, I came across an opportunity with a local startup company that was exactly what I wanted to do. For years, I had been looking for an opportunity to apply my skills and experience building software systems to a cause that I am passionate about: namely, to address the threat of global climate change. The company, Small Energy Group, builds software to help organizations save money and reduce energy consumption by operating their facilities more efficiently. 1/3 of all energy consumed in North America goes to heating, lighting, cooling and ventilating buildings, so helping buildings use less energy has a large impact on our overall ecological footprint.
In February, I launched Agile Vancouver tech talks. For a while I had been feeling that the content of the regular Agile Vancouver monthly meetings had become targeted at project management and were neglecting the technical practices in the agile canon, despite the fact that the majority of our members are developers. The tech talks are an attempt to re-engage this part of the community, with more of an emphasis on discussion (fishbowls or Q&A) and hands-on tutorials than on talking heads.
My first major launch with Small Energy Group was a site for Earth Hour. We partnered with various local organizations to help them track their energy consumption for the event to show and quantify the impact of their energy savings actions. The event proved to be great PR for the company and demonstrated our ability to respond quickly and out-maneuver larger players.
When I started at Small Energy Group, I was the third developer. During April and May, we doubled that number. Okay, one of the hires is a graphic designer, not a developer — but as I had been doing (poorly) some of the graphic design, his hire was a big relief.
I also zipped down to Seattle for alt.net. It was a good opportunity to catch up with former colleagues and to reacquaint myself with what was happening in the .NET space.
I also had been doing some local consulting on the side, mainly retrospectives, training and the like. After working for a larger company and being part of a broader community, working at a startup felt a bit insular. I found that the occasional consulting gig was a great way to stay in touch with what was happening in the local software community.
During the middle of May, I headed out to Toronto to speak at DevTeach. It was a bit of a push to get my three presentations ready in time, as I didn’t have much chance to work on them beforehand, but I managed. Unfortunately, I didn’t get to as many presentations as I would have liked. Being there, I did feel a bit like an impostor, however, not having done any active .NET development in several months. At least I was a member of the cool kid minority doing .NET on MacBooks.
In June, we had our first company retreat on beautiful Hornby Island. The company founder’s family has some beautiful property on the island and we spent our days in planning and brainstorming sessions, and our spare time exploring the island. The retreat formed the basis for planning the first productized release of the SEG software. The plan was to spend one week building up different core areas of the product with the full team devoted to a specific feature.
My family also spent a few weeks over on Gibsons. I was commuting back and forth by bike (and ferry); it was hot and strenuous but a beautiful to get to work.
July was quite disruptive as we were required to move from our apartment as it had been sold. After a bit of a disagreement with the new owner, we found a new place a couple of blocks down. Moving post-kids is definitely much more work than moving pre-kids when our stuff fit into 2 suitcases and 2 backpacks.
When we were able to take a break from packing, cleaning and moving, we had a number of wonderful weekend bike trips to the scenic Gulf Islands.
In August, I delivered a 3-hour tutorial on Continuous Monitoring at Agile 2008 in Toronto. I had originally requested to do a 30 minute product demo, but through various alterations somehow ended up with a three hour tutorial – ostensibly on Continuous Integration. The conference itself was a bit of a disappointment – much of it was rehash. It was a good to reconnect with friends from the Agile community, but I get the feeling that the interesting stuff is happening elsewhere.
September marked the our version 1 release for the product. In reality, the product had been around for quite a bit longer; however, this release marked the release of targeted feature set together with a marketing and sales vision.
In October, I took a trip with my family to San Francisco. Despite being a software guy, I had never been to Silicon Valley. The trip was a great chance to explore the area and to catch up with friends living there.
At the start of November, we hosted the 3rd annual Agile Vancouver conference. We had spent much of the year organizing the conference, securing speakers, building the programme and dealing with logistics, so it was great to have it come to fruition. This year’s conference was the largest yet, with 3 parallel tracks, one day of tutorials and over 20 invited speakers, including Sanjiv Augustine, David Hussmann, David Anderson, Ken Schwaber, Lisa Crispin and others.
We also had our second corporate retreat at Whistler. With the growth of the company, it was a great chance to review our goals, plans and targets.
With the company continuing to expand, we had a family Christmas party on the top of Grouse mountain (one of our clients). Things continue to grow apace with the product and the addition of new clients and colleagues alike.
One key difference between building software products in Java versus .NET is the preponderance of open source libraries in the Java space. In .NET, most companies are content to go with a fully Microsoft stack and, after Microsoft’s anti-open source campaign of the mid-2000s, are wary of letting open source creeping into their code base. However, with antiquity and incompleteness of the core Java libraries, only a crazy or paranoid company would shy away from taking advantage of the wealth of open source libraries available.
This profusion of open source creates two challenges: dependency management and license verification. For dependency management, we use Maven 2. I’m not a big fan of Maven’s build lifecycle, but it does a reasonable job of dealing with dependencies. This past week, Cailie set up the Maven report plugin to generate a report detailing the project’s dependencies and their associated open source licenses. Very useful stuff, and a great application of the metadata associated with each dependency in the Maven repository. We have the reports regenerated as part of our automated build process, so that we can always see the project’s current dependencies and their associated licenses.
Unfortunately, only about half of the open source libraries that we use have their licenses specified in the central Maven repo. For those that do, they all tend to name their licenses differently (“Apache Software License V2.0″ vs. “APL 2.0″ vs “Apache 2″) as the license field is free text, which impedes the readability of the report. And there are always projects that use licenses like “Bob’s Open Source License”, but you would really only know that it is a BSD-derivative (with or without the attribution clause?) by carefully reading the license.
Last week, we got Ganglia running in test and production. If you aren’t familiar with Ganglia or don’t have something like it monitoring your site then check it out. It is, frankly, amazing. Commonly deployed in the HPC space, Ganglia is used by many large, high-traffic sites to monitor their server farms.
Ganglia, using a daemon process running on each server, grabs vital system metrics and transmits them (via multicast) to an aggregation RRD-backed service. It provides a simple PHP web application for viewing charts. For a great example of Ganglia in action, check out Wikipedia’s Ganglia instance.
Contrary to what you might expect from a HPC tool, Ganglia is very easy to get up and going. By default, it automatically starts capturing all of the key metrics (CPU, memory, I/O) for your server. Ganglia also supports aggregating data into clusters; in production we have set up one cluster for our app servers and one for the database, as they tend to have different load profiles. We have a similar set configuration in our test environment so that we can spot load problems before they get deployed into prod. Ganglia is also the cornerstone of our capacity planning (more on this in a subsequent post).
As Ganglia is backed by RRD, it can be used to capture any type of system or application metric (any time series data more precisely). We are in the process of configuring it to capture JVM and message queue statistics, and other application-level metrics.
Log Analysis Service
For application-level monitoring, we have implemented a custom log analysis service. Due to the proliferation of logging frameworks in the Java space, we use SLF4J to aggregate logs from log4j, commons logging and java.util.logging into a single log stream sent over a socket to a simple log analysis service. This service uses logback to filter logging messages and transmit severe or warning messages to our support mailing list. This gives us a pretty good sense of when application arise and how to track them down. We run the same service in our test environment to proactively find issues before the get into production. It has been absolutely invaluable in helping us detect and analyze any problems with our system.
We have also set up Nagios for real-time production monitoring and alerting. We started with Nagios prior to getting going with Ganglia, but after having it running for awhile in our test environment, we realized that the system-level alerts that we were receiving from Nagios were mainly noise without the context and trends in system performance. As Ganglia nicely fills this need, we’ve put Nagios on the backburner.
For simple web-level monitoring and uptime statistics gathering we’re using Pingdom. Pingdom provides a reliable and relatively inexpensive service for routinely pinging pages within our site on a configurable interval from its different servers around the world.
We’re continuing to add new monitoring to our site, but I feel that what we have set up now provides a pretty solid foundation for keeping our system running reliably.
This week’s release blog is brought to you by my colleague Cailie, who kindly agreed to be syndicated on my blog:
Thanks for inviting me to your blog, Owen. I love your choice of wallpaper…. it’s groovy man.
Before I begin – let me send a big congratulations to your Gramma on her 90th Birthday!!! :)
The big news on Monday was that we welcomed two new dev team members: an excellent Computer Science/Physics student on her first professional work experience, and a seasoned senior software architect hailing from the world of heavy traffic, high-profile web portals.
The last time new developers joined the team, we were in the midst of a crunch. We had to let them fend for themselves, and support each other, because we were all too busy getting the feature-set out on time. They ended up coping very well, and even managed to contribute significantly to the release. After the crunch, the team had a bit of a retrospective chat, and we all acknowledged that the new joinees were given relatively little support during their first few weeks. The two new developers shared their views on whether there had been a productivity gain or cost in handling their introduction this way.
One commented that in his efforts not to impede important development, he had gotten unnecessarily “stuck” on gotchas. He felt that a little more investment of knowledge transfer would have increased his productivity. However, he felt that it would have been inappropriate for a fully productive team member to spend time training when there were impending high-priority development targets. The other said that he had liked “getting thrown into the fire” because it had presented an opportunity to quickly get a sense of the big picture, and to learn how the team worked in a state of rapid development. We could see that there was no clear-cut answer to the question “how do we effectively introduce new developers”? So we conceded that new developers should pair with an experienced team member, to whatever extent possible, without affecting the path of critical development. There was a bit of shoulder shrugging as well, as if to say “it’s a tradeoff/balance.”
This week, however, the Product Manager suggested that our development theme be “JIRA Cleanup”. That is, bug fixes and small-magnitude features/tasks. In a startup, all development is important… but it was not a “crunch” week. So I paired with the co-op, and Owen paired with the senior guy.
Jeremy and I had talked before about the fact that the best task for a new team member is a vertical slice. That is, a task that spans across all software layers — from user interface, to business/domain layer, to data access layer, to database — but that involves as narrow a topical focus as possible. On Monday, I browsed through the “Unassigned” JIRA tasks scheduled for the week, and found a few issues that fit those criteria. I also noticed a critical-priority, and quick-to-fix, bug that would provide a rewarding introduction to the development process. I suggested these issues to her as appropriate starting points. In our team, we usually choose what tasks we want to do; the closest thing to assignment we have is nomination. So I explained to the co-op that I was just helping her select tasks for the first little bit; ultimately she would be choosing her own.
In Weekly Release Blog #5, he mentions the policy at IMVU that strives to have new joinees promote code that they wrote to production on their first day. This is not part of our policy, but I think it’s worth mentioning that the co-op’s first bug fix (the critical-priority quick-fix) was promoted to production on her second day! This was a JSP change that opened up some functionality that had become necessary for one of our customers. How gratifying, for one’s first check-in at a new job to truly matter – to provide immediate and tangible benefit to the customer.
As for senior guy, he and Owen flexed their fault-tolerant computing muscles this week. They troubleshot and corrected problems in the handling of data feed failures that had been exposed by outages occurring over the weekend and intermittently throughout the week. The experiences this week underscored the importance of ensuring completeness in the exception handling functionality and failure state test cases.
During the ramp up period this week, we seized the opportunity to audit and update our Wiki documentation. We are now firmly rooted in our Wiki habit — as with any healthy routine, it’s annoying at first, but now, we are reaping the benefits of our Wiki’s geometric growth. Taking time to keep those development environment setup pages fresh and accurate pays for itself again and again. The new team members this week played along and updated any discrepances they found as they worked through the steps.
Two recent process improvements helped reduce development environment configuration overhead — a script to create a development database (the script downloads a suitable subset of the production data), and a script to automatically configure the application server (Glassfish). The Glassfish configuration script not only saved time spent in configuration, but also time that would have been spent troubleshooting magical errors resulting from an accidental configuration mistep. The new guy added another improvement to the development environment configuration process. He thoughtfully committed the development environment SSH settings (tunnels, server names, etc.) to the SVN repository. It’s a sensible thing to do, but for some reason, we had been passing our settings from person to person. Now that it’s in subversion, changes to the network can be easily applied to all the development machines. Developers merely need to do an SVN update before starting their SSH session. Well, except for those of us connecting via putty.exe (ahem).
Last week, I set up something that I have tried and failed to convince many operations managers to implement: getting all production configuration files and scripts under source control. I’m not sure if it is out of security concerns or a fear of contaminating the production environment with version control software or because revision control is considered to be strictly in the domain of developers, but I’ve found a major resistance to the whole notion
of using SCM from operations folk (unless it is introduced under the banner of Configuration Management, and then it’s some antiquated SCM like PVCS). And, no, backup is no substitute for a good version control system.
Having configuration files under version control provides a comprehensive summary of changes to the production environment. If configuration is being altered, it is easy to see what’s changed and to revert any undesired modifications. It also makes it easy to analyze differences between one environment and another, which greatly simplifies the task of keeping test in sync with production.
We use Subversion in development for managing source code, so it’s a natural fit for us to use in production configuration as well. We simply tunnel svn over ssh to access the repository from each environment. We use the same repository as development, but for the security conscious, it’s easiest enough to set up a separate repository.
One challenge about bringing configuration files under version control is that the files tend to be spread out all over the file system. One option is to try to configure each application to load its configuration files from a common location. But this tends to be hassle and may not be supported by some applications. Moreover, it generally means taking that system offline to change its config file location. A better alternative, at least in *nix systems, is to use symbolic links to reference a common folder under version control.
In terms of organizing the files, we simply have a folder structure that looks like:
/<server or server class>
/<configuration files and scripts>
I’ve always found the term Configuration Management to be pretty confusing. It can mean anything from “our code’s in version control” to having a separate team with their own SCM system serving as guardians tracking and vetting everything going from development into production. So in this context, I think that the term Configuration File Management is maybe a more accurate description of what I’m going for here. Regardless, it’s a practice that I would like to see more teams follow.
Three releases last week – at this rate I’ll need to rename this blog series. For the most part, the releases were a non-event. Quite quick and painless. One thing about releasing software in these small increments – frequent releases get easier, not harder.
One thing that is constant about each release is change: change (obviously) to the application logic, change to the database, or change to the environment that we are deploying into. All of these changes should be scripted. The last thing that you want to have to remember during the pressure of deployment is a set of extra manual steps required for just this release.
To deploy the application, we have a simple shell script that downloads the latest pinned build from our automated integration server and deploys it into the app server. We use JetBrain’s TeamCity as our integration server to consistently build and package the latest code into a WAR. As for the app server, we’re running Glassfish, which is nicely scriptable – at least as far as Java applications servers go.
To deploy database changes, we’re using Bering, a lightweight, Rails-derived migration framework that uses Groovy as it’s scripting language. Deploying with this level of frequency, I consider a database migration framework to be absolutely essential — the last thing that you want to have to be dealing with right before a release is doing database diffs. A database migration framework makes database changes simple, automated and continuous. Database migrations are automatically applied to our test and production environments with each deploy. We also try to ensure that all database migrations are reversible (they have a corresponding “down” script) so that changes can be rolled back if necessary.
Deploying environment and configuration changes tends to be a bit more of a manual effort at this point. Glassfish configuration changes can be scripted and this is something that we are starting to take advantage of. These types of changes seem to happen infrequently enough that it still tends to be ad hoc. But anything like this can be easily forgotten during during a push. It would be ideal to have something similar to a migrations framework for applying configuration changes, and we should take a closer look at something like Capistrano for achieving this. We have all of our production configuration files in subversion (subject of a future post), so these can be pushed out using an ‘svn up‘. Most applications also require a restart or a kill signal to reload their configuration. As far as system installs, patches or upgrades are concerns, these also tend to be manual. There is support for OS-level patching from our host provider.
While I feel that our existing deployment process is quite simple and quick (as it needs to be to support weekly — or more frequent — releases), there is still ample room for improvement and further automation. I’ll continue to blog about enhancements to this process as we make them.
I wasn’t involved with last week’s release. Actually, I left work early to celebrate Swedish Christmas with my family. And the release went ahead without a hitch. Bliss.
I believe that making a release is a core responsibility that should be shared by all members of the team. Anyone should be able to do it, and everyone is accountable for regularly participating in a release to ensure that their knowledge of the release process and the production environment stays current. The release is where the rubber meets the road; it is where our software toils become truly relevant and valuable. It is a shared formative experience for everyone on the team.
As I had been involved with setting up and scaling the system out to the new servers, it was useful for me to sit out this release and let others understand the changes to the environment. Not that I mind.
For this to work, making a release has to be:
Last week’s deployment was more complex than usual – we redeployed the entire system and migrated onto new hardware. Our goal, as usual, was to minimize or prevent downtime. There were a number of ways to go about the roll out, so at the start of the day we sat down and built a plan for the release.
The plan consisted of a whiteboard diagram of the current server topology, the new server topology and the steps that we were planning to take to move from one to the other. We had been doing this more informally for previous releases, but having the plan explicitly specified ensured that everyone was on the same page throughout the day. And having the plan highly visible in the project area made it easy to gauge our progress and see what was left to do. I think that this will be a valuable addition to our weekly release process.
80% technical, 20% social change. This blog is dedicated to finding ways to sustainably release software more frequently.