A Few Words About Sunday's Downtime

If you use Planning Center Services or Check-Ins, you might have noticed that our sites had some serious reliability and performance issues at just about the worst possible time on Sunday morning, February 21st. Not only did this affect you, our customer, but it affected many of us personally since we rely on Planning Center in our own churches. We take that responsibility very seriously. Your trust means everything to us, and here are some changes we're implementing so that we can keep that trust:

We're making serious investments in our infrastructure to increase redundancy. On Sunday there were several problems which on their own could have been handled without affecting customers, but combined they created a major outage. We're adding more servers and more layers of reporting and monitoring, which will help protect against these kinds of failures in the future.
We've launched a new status page as a place to better communicate with you in the case of future incidents. From here you'll be able to see the real time status of our systems, and it will act as a first point of communication for us during incidents. Our new status page is available at http://status.planningcenteronline.com.
We’ve created a Twitter account specifically for sharing system status updates. You can follow it at @PCOstatus
We've made some changes internally to give our customer support team more information so that they can communicate with and help you better.

We are blown away by the increasing number of churches who rely on us, and humbled that we get to support so many ministries. We will continue to do everything in our power to not let you down again.

If you're interested in the technical nitty gritty of what happened, read on.

Lots of Servers

All of our applications are hosted from two different data centers, one on the east coast in New Jersey, and the other on the west coast in Los Angeles. We do this primarily for two reasons: first, data from a web server will get to you faster when it doesn't have to travel as far. (You can blame those pesky, unbendable laws of physics). If you are a user on the east coast, our site will respond much faster if you are connecting to servers in New Jersey than if your information has to travel all the way to Los Angeles and back. The second major benefit is that our dual data centers act as backups for each other. When we need to perform planned maintenance, or have an unexpected problem, we can shift all traffic to a single data center. This usually happens without customers even noticing (except for maybe a slightly increased response time due to the increased distance). Generally, this works really well and it's been a great advantage for us. The catch however, is that both data centers need to have the capacity to handle our entire peak traffic load on their own, which means purchasing twice as much hardware as we'd normally need. On Sunday morning we found out that our growth has outpaced our infrastructure during a failover situation, and it's time to buy more.

The Incident

At 6:53am Pacific time, the phones of everyone on our operations team started buzzing with automated alerts from our server monitoring systems. (If you work in tech, you may know the the sense of dread you have when your phone goes off at odd times). It turned out that a single server on the west coast was reporting slightly higher than usual load and increased response times for a small portion of our traffic. It was not a serious or widespread incident yet, but we like to know what's going on early so that we can prevent small issues from becoming large ones. While our team investigated the root cause of that single server's performance (we have five web servers in each data center), our automated systems detected another problem in our east coast data center that it considered serious enough that it automatically diverted all customers to the single, already strained, west coast data center, during the busiest hour of the busiest day of the week. That's when you began to notice problems.

This was also the point in time that a few of the members of our customer support team saw issues too, when Planning Center stopped working for them in the middle of their worship team rehearsals. They grabbed their laptops and started helping with anything they could do.

Normally, when our systems automatically failover to divert traffic off of a failing data center, they will continue to test that data center's connection and eventually restore its load once it is operational again. Unfortunately, a setting was inadvertently changed on February 11th, which prevented this from happening, and kept the full firehose of our traffic aimed at the west coast. We measure our traffic in terms of how many requests are being handled every minute, or RPM. At the peak of the incident, we were dealing with nearly 70,000 RPM, and the single data center couldn't keep up. (To put that number in perspective, that means roughly 5,000 people would have navigated to a new page in the time that you read this sentence.) This is when you started seeing increased load times, and widespread errors.

Eventually, we were able to get the east coast data center back online and carrying its share of the traffic. Unfortunately, in the confusion of diagnosing multiple unrelated issues simultaneously, it took just over an hour to get there. By 8:10am, everything was back to normal.

Technical Takeaways

The incident highlighted a few weak spots in our infrastructure, monitoring, and processes. Here's some technical items that we've either already implemented, or are in the process of doing:

The initial increased load on our west coast data centers was caused by several small issues, all compounding to cause one medium-sized issue. We'll be addressing each of these individually in the coming weeks by further decoupling our systems from each other and from third parties.
During our post-mortem analysis, we discovered that not only did our automatic failover fail to rebalance traffic once the east data center was back up, but that it never should have diverted traffic in the first place. It was a false alarm. We've reconfigured our load balancers to get more information before acting.
Our monitoring systems failed to alert us that traffic had been diverted, which made it difficult to understand why it had been diverted. We're adding more layers of monitoring.
We need more hardware in each data center so that we can handle peak load in either place without breaking a sweat. Machines have been ordered and will be in service in the coming weeks.

Conclusion

Even if your eyes glaze over techno-babble, I hope we've been able to communicate just how important keeping our products running smoothly and reliably is to us, and that we don't take it lightly. Thank you again for letting us partner with you.

A Few Words About Sunday's Downtime

Jeff BergOwner / Developer

Feb 25, 2016

Lots of Servers

The Incident

Technical Takeaways

Conclusion