We are expecting Stack Overflow and possibly the Stack Exchange network to have some significant availability issues today. Our best guess is that there will be multiple shorter lived outages, but the potential exists to have a prolonged outage during our peak traffic times. We believe that Careers will be available since it exists on a different SQL cluster.
What Happened:
Microsoft’s current analysis is that we have run into a bug in Microsoft SQL server after our
SQL 2014 upgrade. After our weekly indexing job at 3AM UTC yesterday, SQL is no longer maintaining a healthy cache of execution plans. This results in increased load on the CPU since SQL has to keep compiling execution plans. We believe that increased load will be beyond our capacity.
Current Plan:
Work with Microsoft to get this fixed and tune SQL to mitigate the impact
We engaged Microsoft on this issue yesterday. We will maintain an open conference with them until the issue is resolved. Their SQL developers are actively investigating the issue (we have provided them with all the information they need to investigate this and will continue to provide them with what they need). We are also going to try to try to tune things to minimize the impact of the bug. Our SQL Server consultant Brent Ozar is going to assist us with this for the day.
Contingency: Build out a 2012 R2 instance
Since we can’t rely on Microsoft to provide a fix quickly we are going to work on standing up the previous version of SQL server on another machine. We can’t directly restore the the database to the older version since there is not backward compatibility. Since this bug has hit only a couple weeks after the upgrade our transaction logs don’t go back that far. That means we need to use a tool to bulk insert our database after standing this up. We hope to have this up by tomorrow but there are a lot of unknowns.
Improve our App Side Resilience
Several of the developers are going to work on improving our error pages when SQL becomes unavailable. We are also going to try to optimize the time it takes to switch to secondary SQL servers and automatically try to use our replicas in read only mode.