Outage Post-Mortem: September 2nd, 2014

Outage Post-Mortem: September 2nd, 2014
On September 2nd, 2014 there was an outage of all Stack Exchange sites (Q&A sites as well as Careers) from 17:20pm to 17:24pm UTC (approximately 4 minutes). This was a cascading failure.
A python script used for monitoring our servers was updated and pushed to the load balancers. The updated script wasn’t tested and it included a syntax error. When the monitoring script crashed due to the syntax error it triggered the abrt system (CentOS’s crash reporting system) which in turn ran the sosreport plugin
About an hour into the crash report creating its archive, it started copying the current state of much of /proc for the report. While this specific step of the crash report was being run, the system became unstable and stopped accepting new TCP connections; we believe this was due to reading some specific data out of /proc/net on a system with such a large number of open connections.
No automatic failover triggered because keepalived was still able to communicate between the two redundant systems, and isn’t configured to monitor HAProxy service health currently.
We have planned the following corrective actions:
- Test scripts before pushing … duh.
- Disable or tune abrt / sosreport on all our machines - sosreport taking 90 minutes to run (and making the system’s critical services unresponsive while it does) means it’s probably not well suited to our load balancers.
- We are going to experiment with changing the health check mechanism that triggers a failover from the machine level to the software level. So instead of just pinging at the machine level, keepalived will be triggered by a failed check of the state of HAProxy. We have been afraid to do this historically due to false positives, but since we have had multiple failures where automated failover has not worked, we want to try a different angle.
- Move websockets to a dedicated set of load balancers (Our sites don’t go down when websockets is down, and websockets is responsible for the majority of concurrent sockets).
Outage Schedule of Events (UTC Time)
- 16:27:10 New version of monitoring script deployed via puppet
- 16:27:17 First of abort of of the monitoring script on ny-lb05
- 17:19:20 Conntrack starts to decline - 655,320
- 17:19:30 CPU Starts to climb on ny-lb05
- 17:20:11 Stack Overflow down (pingdom)
- 17:20:12 sockets climbing on ny-lb05
- 17:21:25 Conntrack starts to decline - 615,312
- 17:23:50 Conntrack starts to increase - 743,000
- 17:24:05 Keeplived killed (ny-lb05)
- 17:24:11 Stack Overflow up (pingdom)
- September 3, 2014 (3:03 pm)