Our hosting provider will be performing maintenance on the UPS system to remove 3 bad batteries on April 25th, 2013 sometime between 6:00PM to 11:59PM. We are told this will cause no interruption to our power or services and have no reason to believe otherwise.
This is an informational post only, we hope…otherwise we’re offline and it obviously had an impact.
This Wednesday we’ll be upgrading the storage for the database servers in the New York data center. During this window we hope to have less than a minute of downtime during the failover sometime mid-day and the same again when failing back.
Currently we have 6x 200GB drives for data in our Dell R710 servers (Intel 710 200GB drives, plus 2 others for the OS). These servers are NY-SQL03 and 04. They will be getting upgraded to 4x Intel S3700 800GB drives a piece, in a RAID10. This leaves room for 2 more later if space is needed before the next SSD generation.
Those 12 left-over Intel 710s will be joining the already present 13 drives in each of NY-SQL01/02 (Dell R720s), which each have a 12x 200GB RAID10 + hot spare. Currently this leaves room for 11 more drives in each server. After the upgrade they’ll have a 18x 200GB RAID10 + hot spare for primary data.
We’ll upgrade 04, move drives and upgrade 02, then fail over to those replicas. After that, we’ll repeat the procedure on 03 and 01, failing back to them likely later in the day. While we’re powering them down (quite rare now), we’ll be installing 10GB NICs in each server, but won’t be activating them quite yet.
Weather permitting, we will be moving most Stack Exchange operations to our new New York data center facility on Saturday Feb 9th, 2013.
A high level overview (a.k.a. “THE PLAN”) of what will happen on Friday night and Saturday:
Here’s what you should expect to see while browsing the sites:
We hope all goes to plan, but fully expect to be extinguishing fires that arise as the result of our move in the hours that follow. If you see any major issues that haven’t been reported once we’re back up, please alert us with a post on meta.stackoverflow.com, we’ll be watching it closely.
Our primary SQL Server experienced a “non yielding scheduler” and wrote a 234 GB memory dump to disk before restarting. During this time all sites were offline. After the crash dump completed SQL was able to start without issue. We will continue to monitor the server and have plans to switch all of our sites back over to NY this weekend which are running on fully patched SQL 2012 instances. Stay tuned…
We will be performing 2 read-only tests against our New York data center this afternoon starting shortly after 8:30 PM UTC.
First meta.stackoverflow.com and then stackoverflow.com will appear briefly in read-only mode while we load-test our New York servers. After the test the sites will return to their normal functionality.
We were unavailable for about 15 minutes due to a large scale DDoS attack that hit Hurricane Electric who is one of our transit providers (our provider’s provider). They null routed the attack as soon as they were made aware of it.
At this point all connectivity has been restored. We will continue to monitor the situation.
We were down for around 30 minutes due to DNS issues.
Things should be coming up again shortly. We recently relocated our NY datacenter, and therefore all the IPs in NY changed.
We didn’t update to the new IP for the master DNS server (in both the named.conf and the zone files), and therefore all of our slaves hit our expiry time (one week). This is fixed now, so things should start coming back up.
We apologize for the oversight on our part.
We were down a few minutes this morning due to a failure on our linux load balancer:
Dec 12 14:22:13 or-rtlb01 kernel: igb 0000:07:00.1: Detected Tx Unit Hang
This happened right after a rebuild of our web socket servers which reconnects approximately 30,000 - 100,000 connections (depending on the time of day). The load balancer was very unhappy with this, though it’s never been a problem in previous builds.
We have failed over and back from our secondary load balancer which resumed normal site operations. Even though the primary load balancer appears to be fine after the kick, we’re not trusting it just yet and have shifted it back in rotation while we fully investigate.
We worked with the current datacenter to find the cause of the outages after the last issue a half hour ago. A default route was incorrect, and we believe this to be the root cause of the intermittent connection throughout the day - a lingering issue from this morning’s changes on the upstream side.
PEAK internet has now corrected the route and Stack Exchange should now be online and stable. We apologize for the inconvenience caused by the interruptions.
Here’s the info from our upstream provider on the outage we experienced earlier this morning:
I suspect that, when the router under maintenance came back online, there was a couple of minutes of routing inconsistency on our border router connecting to LS Networks (our upstream that feeds the Corvallis office directly).
This was unexpected, and should not have happened. The fact that the connectivity issue only arose when the router-under-maintenance came back online (and not when we shut it off), and that the inbound traceroute died on one of our Corvallis border routers, suggests to me that the interaction between OSPF and BGP on the Corvallis routers may be culprit, and I will perform a review of the local routing design.