Last week, we needed to run a long-running migration as part of our weekly deployment. Our company deals with large volumes of time series data – the majority of which is managed through a single large table (we haven’t implemented data sharding yet). We needed to do some restructuring and reindexing of this table, which we knew would take some time to run. We also wanted to be able to perform this maintenance in a way that would minimize user impact and the possibility of downtime.
Since we started practicing zero-downtime database deployment, we have the freedom to decouple database expansion migrations from the actual release as we ensure that all expansion migrations retain backwards compatibility with the existing release. This helps mitigate some of the risk of releasing, and allows us to break up a release into a few manageable chunks. This week we chose to take the option of upgrading the database on Thursday night and then making the release on Friday.
Prior to running the migrations, we halted all writes to the large table by suspending data processing from the queues that feed data to it. This is a nice side-effect of using queues – they provide a safe lever to control or throttle the behaviour of the system during deployment and maintenance. While it wasn’t technically necessary to pause the queues during the migration, it did allow the migration to complete faster with less contention.
I would like to say that the migration went without a hitch; we did incur about 10 minutes of downtime during the table restructuring. In retrospect, we should have created a temporary table that we could have performed the migration on and then swapped it with the actual table once the restructuring and reindexing were complete. Suspending writes to the table would have been important here as it would have ensured that the actual and the temporary table would have stayed in sync.
Another hiccup that we encountered during the migration was that the remote connection to the production database server was severed while the migration was in progress (due to a timeout on the SSH connection). Oops. Fortunately, the connection was broken while executing the last (and longest running) step in the migration and it continued to run to completion on the database server. This did make us realize that – especially for long-running operations – running remotely is risky.
We currently run migrations against production by using Bering, the same migrations framework that we use to migrate development and test databases. This makes me appreciate why migrations frameworks like dbDeploy focus just on generating the migration SQL script – rather than executing it. Seasoned DBAs are likely savvy to this risk and want something that they can easily run directly on the database servers. I’m planning on making some extensions to Bering to support generating SQL scripts to package as part of our deployment.
80% technical, 20% social change. This blog is dedicated to finding ways to sustainably release software more frequently.