Outage Postmortem: Sepember 11th, 2014

Outage Postmortem: Sepember 11th, 2014
On Thursday, September 11th, a number of users were unable to successfully load any Stack Exchange Q&A site. One specific resource, a CSS file, was failing to load from our CDN provider, CloudFlare.
The cause of this was twofold:
- Between 18:55 and 19:00 UTC, CloudFlare’s monitoring systems began detecting 50% packet loss from 6 of their data centers to our servers. This loss was later traced to a common routing path that these data centers were using to reach us.
..which had no impact, until..
- Between 19:05 and 19:08, a new build was rolled out to our web servers, which updated the ‘all.css’ file. The HTML being sent to users references a specific version tag (href=“//cdn.sstatic.net/stackoverflow/all.css?v=3e873254b4db”, for example), and the new version began getting requested of the CDN servers.
The CDN servers attempted to request the new version file from our servers each time it was requested of them by our users, which was doomed to failure - stackoverflow.com’s 34 KB all.css file splits into 23 1500-byte packets to send over an HTTP connection; through consistent 50% packet loss, TCP just couldn’t get all the data to them before the connections timed out. Our servers sent this file 158,000 times during the outage; none or very few of those were successfully received by the CDN servers.
At approximately 19:50, the routing issue was resolved - unfortunately, we hadn’t quite pinned down the common route at that point, so we were unable to determine exactly which hop in the route caused the problem. We’re guessing that the transit provider with the failure detected and resolved it in their own monitoring.
We’re looking into ways to be more resilient against this kind of failure, and to notice it sooner. We’re sorry for the inconvenience for those of you affected!
Our thanks to the team at CloudFlare for helping us to track down exactly what went wrong, and for working with us to help find ways to prevent such a severe outage from an internet routing problem in the future.
- September 12, 2014 (9:08 pm)