Migrating Databases With Zero Downtime

Originally posted on tech.shift.com

We recently completed a massive database migration here at SHIFT.com. We migrated our application from using MongoDB / Titan as the primary datastore, to Cassandra, with no downtime or performance degradation.

Jon Haddad and I recently did a webinar with DataStax to talk about the migration, and there was a lot of interest not only in our experience moving to Cassandra, but also with the specifics of how we migrated our entire application with no downtime. This post talks about the mechanics of how we went about migrating the site. It doesn't get into data modeling or performance numbers. If you missed the webinar, it answers some of those questions, which you can catch here: http://www.youtube.com/watch?v=9XHigNKJJhI.

Motivations

During the course of the past year, we had been battling performance problems with Titan database, as well as the devops hassles of running a MongoDB cluster. Additionally, an unrelated backend data collection system running MongoDB had suddenly hit a performance ceiling, after which scaling became extremely difficult... something we wanted to avoid with SHIFT. One of the main causes of pain with both databases was that they provided too much abstraction over how your data was laid out on disk and across machines, which made performance unpredictable. What attracted us to Cassandra was that Cassandra provides a great deal of control over how your data is laid out on disk, and exposes it through an elegant query language. It is also much easier to work with from a DevOps perspective.

After a great deal of thought, discussion, and testing, we decided that moving our application to Cassandra would be the best move, long term, for SHIFT.

Goals

Clearly, migrating an entire application from one database to another is a big undertaking. Our situtation was complicated by the fact that we weren't migrating from MySQL to PostgreSQL, the databases we were moving between had completely different data models, so none of the existing schemas would work with Cassandra. Additionally, SHIFT is an active application that is central to many people's day to day work. So we needed to perform this migration without any downtime, and without any degradation in performance. Also, since there were so many components to migrate, pulling a few all nighters, shutting the site down, migrating the data and then bringing the site back up wasn't a sustainable or practical approach. We needed to perform the migrations during the day, while people were using it.

Our Solution

The solution we came up with was to split the migration into two parts: writing, then reading.

For each component that we were migrating, we would come up with a data schema that made sense for that part of the system. We would then make a branch off of master, the 'writes' branch. The writes branch was responsible for 2 things. First, it would mirror all writes to Mongo/Titan into it's eqivalent Cassandra table. So everytime a Mongo document was saved, a corresponding row would be saved to Cassandra. Next, it would have a migration script that would copy all of our historical data for that component into Cassandra. So once the writes branch was deployed, and the migration script was run, all of our data was in both Mongo/Titan and Cassandra, and anything that was created or updated was also written to both places.

Next, we would make a branch off of our writes branch, this was our 'reads' branch. The reads branch switches all reads from Mongo/Titan to our new Cassandra table(s), removed all references to Mongo/Titan for the migrated component, and stops all writes for them. In practice, this is the most complex branch to write because of minor variations in the way things come back from the different databases.

Using this strategy, we were able to transparently run major migrations during the day, while people were using the site. People would occasionally notice that certain parts of the site got faster, but that was about it.

If you're thinking about doing migrations using this strategy, here are some things that will save you some time / sanity.

1. Migrate only one thing at a time. It can be tempting to try to kill several birds with one stone, but you're only adding complexity. Migrating a single component at a time makes it much easier to isolate any bugs that you've intoduced in your migration.
2. Write both the read and write branches before deploying anything. You'll often find things that you missed in the writes branch while writing the reads branch. Additionally, knowing that your reads branch is bulletproof before deploying the writes branch takes a lot of the pressure off. Finally, having a reads branch ready to go means that you can deploy it as soon as your migration script is done. This reduces the window in which inconsistencies between the two databases can be introduced.
3. Spot check your data after deploying the writes branch. Hopefully, your unit tests cover most of your use cases, however, it's still important to check that your data is actually being written as expected in production.

Results

With the migration completed, we are very happy with our approach. We didn't have any major problems during the migration, and there was not any downtime or performace degradation. We're also very happy with our switch to Cassandra. Our site has become much faster as a result, and it's really easy to work with.