We are currently online and running operations from the New York data center again after the power outage earlier this morning. We are in a holding pattern since the UPS system serving both our primary and backup power feeds (yes, we’re very excited to learn this too) is still being worked on.
At 11:59 UTC The decision was made to move operations back to New York and risk a few seconds data loss rather than run partially up and read-only in Oregon for an indeterminate amount of time.
Here is a timeline of the events following our last update:
- 10:19 UTC sstatic.net was pointed back to New York since Oregon didn’t have a very recent content change
- 10:50 UTC Chat operations moved back to New York (chat.stackexchange.com source record moved)
- 11:20 UTC Internal DNS servers are operational
- 11:21 UTC Update from Internap:
At this time we continue to investigate our UPS system. The current status is commercial power is restored, however our engineers are still working with our electrical vendors to stabilize UPS power.
We will provide another update in 1 hour.
- 11:26 UTC NY-SEARCH01 (Elasticsearch) was found to have a RAID10 inconsistency, repair begins
- 11:36 UTC Elasticsearch cluster is now available, yellow, and recovering
- 11:59 UTC Decision made to move all services back to the New York data center
- 12:02 UTC DNS migration script build updates both local and CloudFlare DNS servers - TTL for all affected records was and is 5 minutes
- 12:13 UTC APIv2 is brought offline and online again briefly to resolve redis connection issues
- 12:25 UTC Update from Internap (yes, it’s identical):
At this time we continue to investigate our UPS system. The current status is commercial power is restored, however our engineers are still working with our electrical vendors to stabilize UPS power.
We will provide another update in 1 hour.
- 12:28 UTC Internal load balancers are brought online
We are still accessing secondary systems but it appears we’ve had no drive data loss which was out biggest fear in a hard power down situation.
We will provide further updates as we get them - there are no planned immediate infrastructure changes or additional outages on our side at this time. However, we do have several things we can improve about the unexpected failover process which we’ve already began working on.