…continued from A story about two servers.
So, now the servers have been put into place at our new co-location. The old servers have been replaced to full extent and have been moved from our old co-location to the new one aswell.
I thought I would tell you a bit about what problems I encountered during this server migration and how I came to terms with them.
The first and most obvious problem was, since we could not put in the new servers at our old co-location, how would we sync our 180GB databases so they contained the same data on the new servers as on the old servers, virtually in real time. Setting up db replication slaves would put too much of a strain on the system and would build up massive binlogs. Since the old systems were being replaced because of lack of disk space, this was not an option. I had to analyse the applications running on the servers and device a solution which only sent the core data and let the rest of the data be built parallel and identically on both systems.
So after many log debug sessions and consulting with the other developers who knew this system a lot better than I did, I realized that we had three forms of raw input into the database and the rest of the data was a result of that input.
One of the input forms came in as packages, once per day and customer at a total of about 300MB per day. All of these packages were automatically uploaded to our web server and saved in the appropriate place by the script receiving the file.
We solved this by modifying the script to curl each file which was uploaded, to the new servers. Thus when a customer uploaded their daily package to us, our servers automatically replicated that request to go to the new servers as well.
Another input form also came as packages but they came via email, once per day and customer. This was a no brainer, we set up the email address receiving the mail to simply forward all the mails to the new server as well.
The third input form was a bit trickier. Live data not sent by our customers but by their individual browser clients. Now the critical difference here is that, while I can take a snapshot of the databases from the old servers and transfer it during the day to the new servers and put it into their databases, that process took 18 hours. This would mean that any database I would put into the system would be up to 18 hours old, thus it would be missing 18 hours of this live data. This would not be acceptable. However. This is what I started by doing. We also put in place just like with the first input format, a curl which would transfer all the data as it came in to the old servers, from the old servers to the new servers.
So now I was left with an accurate database except there was an 18 hour gap in the data. Luckily, all the data from this input form was inserted into a single table of the database, and even if some customers had 6GB of data in this table, very little were used each day and much of it could be discarded. So since I previously set up backup scripts for the old servers as explained in this blog entry, I could start truncating this table so that it only contained as little as two days data. This meant that the replication of this table could go a lot faster.
By making the affected tables very small by truncating them and by opening the firewall on the old servers to let in database connections from the new servers I could sync each table for each customer individually and close the 18 hour gap completely. This, without putting too much stress on the old servers.
Having two sets of servers which did the same things parallel gave me a chance to clean up the old system as well. There were a lot of old artefacts of test code here and there, databases which were not used and files which simply could not be moved while the servers were in production. This opportunity gave me the chance to clean up after so much that had gone wrong previously. The new servers were not just a replica of the old servers. They were a clean and augmented version of the old servers.
After the servers were totally in sync and I was sure that they processed all data correctly and presented it correctly, I changed the NS records for our domain from the old server ip to the new server ip and let the transition slowly take place.
How did we know everything was identical on both servers? I put together a script which did the bulk of it. The script made two database connections, one to the new servers and one to the old. For each database, each table and each row, it made an md5 checksum and compared these checksums from both databases. I even did this without sorting the mysql result sets. When databases are identical, they really are identical it seems. The script gave out warnings for some of the tables, but all of those warnings turned out to be simply sorting differences. They were few enough that we could check them by hand. The script ran for approx 8h. I feel I could have made it more efficient, but it sufficed for our purpose.
Now with this server migration complete, I hope to be able to write a little bit more backend/frontend blogs here.
Tags: data sync, Database, database sync, live database data sync, mysql, replication, server migration