On Friday, September 5th we had about an hour and a half of unexpected downtime. At 2:33pm Pacific Standard Time, I was notified via a text-message from our server monitoring service that Planning Center was down, I called RackSpace.com and it took about an hour to figure out exactly what happened and then another half hour to bring back up the system and verify all the services were up and running.
We are meeting with Rackspace today to discuss how they can monitor our servers more thoroughly so that this can not happen again.
We are sorry for the inconveniences this caused, and if you were adversely affected by the downtime, we invite you to please e-mail us or give us a call so that we can hear from you and look into making it up to you.
For those interested here are the specifics on the failure. Several months ago we purchased a secondary database server to have a real-time copy of all the data on Planning Center Online so that if our primary database server ever crashes, we will be able to use this server as our Primary Server. To do this we used the MySQL Replication engine. We paid RackSpaces MySQL Database Administrators to set this up for us and we assumed that they did it correctly, which it looks like they did…except for one setting.
The way that MySQL Replication works is that it creates a log of every command executed on the master database server so that the slave database servers can execute these commands and have an exact replica of the master database. These log files are several gigabytes per day for the Planning Center Online database. There is a setting with-in MySql that automatically purges the old log files after X number of days, this is the setting that the Rackspace DBA forgot to set and therefore our hard drive filled up with over 100GB of unnecessary log files which in turn causes our Database engine to crash.
Once I was able to get in contact with a RackSpace DBA, he set this setting and restarted the mysql server and it purged all of those files and was running great once again. He also did a checksum comparison between the master & slave databases to make sure that there was no data corruption on the slave (which there was not.
If you have any more questions or comments please e-mail or call us and we would be glad to help.
Jeff Berg Owner/Developer Ministry Centered Technologies