Outage Postmortem: January 6th, 2015

Outage Postmortem: January 6th, 2015

At 20:29 UTC today, connectivity was lost to our NY datacenter via our primary ISP. Normally, requests would still flow through our alternate ISP - we’re announcing our IP addresses through both links, so when one drops the traffic should route over to the working path to our systems.
However, the ISP outage was not complete - our BGP session with them stayed up the whole time, and they kept rebroadcasting our routes, despite the fact that all packets bound to our systems were being dropped. We have a ticket open with them to determine how this happened.
With no way to get our main IP addresses accessible to most users, our options were to either fail over to our DR datacenter in read-only mode, or to enable CloudFlare - we’ve been testing using them for DDoS mitigation, and have separate ISP links in the NY datacenter which are dedicated to traffic from them.
We decided to turn on CloudFlare, which caused a different problem - caused by our past selves. Some history for the interested:
In October, we had a situation where a flood of crafted requests were causing high resource utilization on our Tag Engine servers, which is our internal application for associating questions and tags in a high-performance way. The hosts that were sending the requests were configured to send the requests through CloudFlare for stackoverflow.com, which the CloudFlare servers respect and proxy for since they’re hosting our DNS (one button push away from enabling full proxying).

Because we didn’t want these problematic requests getting to us and they were coming from CloudFlare, we configured a rule that’d redirect requests via CloudFlare for stackoverflow.com to http://127.0.0.1/, instead of letting the attack traffic hit our servers.
That redirect rule for mitigating the attack months ago was still in place. Anyone visiting stackoverflow.com was getting 302 redirected to http://127.0.0.1/; not our finest hour. Our primary ISP link came back online at about the same time that we noticed this misconfiguration, around 20:36, so we immediately disabled CloudFlare - but some people still saw the redirect for up to 5 minutes after due to DNS caching. By 20:41, everything was back to normal.
We’re working with the ISP to determine why our failover didn’t work as planned, and we’re feeling an appropriate amount of shame for our attack mitigation rule being applied to everyone. Sorry about that!
- January 6, 2015 (11:11 pm)