This morning at approximately 05:41:59 UTC we experienced an outage of our NY-SQLCL04 Cluster which hosts most things that aren’t Stack Overflow. This outage lasted approximately 2 minutes.
Beginning 20 minutes earlier, we observed VPN connectivity issues with our secondary data center in Oregon. This resulted in dynamic quorum fluxuation as the SQL cluster’s 3rd node (OR-SQL02) was in and out of communication with the 2 in New York. At 05:41:59.705 NY-SQL03 and NY-SQL04 (the NY members) lost communication with each other inside the New York data center. A further loss of communication between the 2 nodes while Oregon is offline results in a quorum loss from the point of view of both members.
To prevent a split-brain situation, the nodes enter an effective offline state when a loss of quorum occurs. When windows clustering observes a quorum loss, it initiates a state change of orphaned SQL resources (the availability groups the databases affected belong to). In the case of NY-SQL03 (the primary before the event), the databases were both not primary and not available since the AlwaysOn Availability Group was offline to prevent split brain (yes…we get the irony).
Current Changes Planned: The offline behavior of the primary in this failure scenario will be improved by a SQL 2014 upgrade in the coming months, where NY-SQL03 would have had its availability groups enter a read-only replica state, resulting in a brief read-only period on our sites rather than a full outage. While not a solution, the failure impact to our users will be drastically improved.
Next Steps: We are still determining the exact reason the 2x 10gb redundant connections for each of the NY members failed for any duration. We have a packet capture of the entire network happening in New York at all times for just this situation, if we can determine a more specific cause of the network failure this postmortem will be updated with that information.
Note: Many of our SysAdmins and Developers are traveling to New York today for our biannual meetup, this will delay our post mortem analysis more than usual.