Stack Exchange Network Status

Here we'll post updates on outages and maintenance windows for the Stack Exchange Network. You can also get status updates by following @StackStatus
  • rss
  • archive
  • Facility Maintenance Apr. 25, 2013

    Our hosting provider will be performing maintenance on the UPS system to remove 3 bad batteries on April 25th, 2013 sometime between 6:00PM to 11:59PM.  We are told this will cause no interruption to our power or services and have no reason to believe otherwise.  

    This is an informational post only, we hope…otherwise we’re offline and it obviously had an impact.

    • 1 month ago
  • Database Maintenance Mar. 27, 2013

    This Wednesday we’ll be upgrading the storage for the database servers in the New York data center.  During this window we hope to have less than a minute of downtime during the failover sometime mid-day and the same again when failing back.

    Currently we have 6x 200GB drives for data in our Dell R710 servers (Intel 710 200GB drives, plus 2 others for the OS).  These servers are NY-SQL03 and 04.  They will be getting upgraded to 4x Intel S3700 800GB drives a piece, in a RAID10.  This leaves room for 2 more later if space is needed before the next SSD generation.

    Those 12 left-over Intel 710s will be joining the already present 13 drives in each of NY-SQL01/02 (Dell R720s), which each have a 12x 200GB RAID10 + hot spare.  Currently this leaves room for 11 more drives in each server.  After the upgrade they’ll have a 18x 200GB RAID10 + hot spare for primary data.

    We’ll upgrade 04, move drives and upgrade 02, then fail over to those replicas.  After that, we’ll repeat the procedure on 03 and 01, failing back to them likely later in the day.  While we’re powering them down (quite rare now), we’ll be installing 10GB NICs in each server, but won’t be activating them quite yet.

    • 2 months ago
  • Succeeding back to New York - Feb. 9, 2013

    Weather permitting, we will be moving most Stack Exchange operations to our new New York data center facility on Saturday Feb 9th, 2013.

    A high level overview (a.k.a. “THE PLAN”) of what will happen on Friday night and Saturday:

    • 02:00 UTC Saturday
      • Restoring the latest full backups from Oregon (OR)
    • 15:00 UTC Saturday
      • We’ll do a go/no-go check with Internap (our colo) in New York due to the blizzard there.
      • Restore the latest transactional backups from OR
      • Slave our redis caching servers in NY to pull from the OR live servers
    • 16:45 UTC Saturday
      • Restore another batch of transactional backups from OR
    • 17:00 UTC Saturday
      • A final go/no-go decision will be made with the latest info from Internap
      • Careers ads will dissapear from Stack Overflow
      • Read-only mode engaged for Q&A Sites + Careers
      • All live cache layer data will be persisted
      • Final database saves will be completed, then locked in read-only mode
      • CDN will be re-pointed to New York
      • Final transaction logs moved to New York and restored
      • New York databases recovered and added to AlwaysOn Availability Groups, and brought out of read-only mode
      • DNS changeover to our New York IPs

    Here’s what you should expect to see while browsing the sites:

    • 15:00 UTC Saturday
      • A banner pointing to this blog post will appear on the main Q&A sites
    • 17:00 UTC Saturday
      • All sites (excluding chat) will enter a read-only mode
    • 17:30 UTC - Soon™ after that Saturday
      • You’ll be pointed to our New York data center, the sites should be back to normal

    We hope all goes to plan, but fully expect to be extinguishing fires that arise as the result of our move in the hours that follow.  If you see any major issues that haven’t been reported once we’re back up, please alert us with a post on meta.stackoverflow.com, we’ll be watching it closely.

    • 3 months ago
  • Database Issue Feb. 5th 2013

    Our primary SQL Server experienced a “non yielding scheduler” and wrote a 234 GB memory dump to disk before restarting.  During this time all sites were offline.  After the crash dump completed SQL was able to start without issue.  We will continue to monitor the server and have plans to switch all of our sites back over to NY this weekend which are running on fully patched SQL 2012 instances.  Stay tuned…

    • 3 months ago
  • Testing our New York datacenter Jan. 29th, 2013

    We will be performing 2 read-only tests against our New York data center this afternoon starting shortly after 8:30 PM UTC.  

    First meta.stackoverflow.com and then stackoverflow.com will appear briefly in read-only mode while we load-test our New York servers.  After the test the sites will return to their normal functionality.

    • 3 months ago
  • Routing issue Jan. 25th, 2013

    We were unavailable for about 15 minutes due to a large scale DDoS attack that hit Hurricane Electric who is one of our transit providers (our provider’s provider).  They null routed the attack as soon as they were made aware of it.

    At this point all connectivity has been restored.  We will continue to monitor the situation.

    • 4 months ago
  • Outage Jan. 21th, 2013

    We were down for around 30 minutes due to DNS issues.

    Things should be coming up again shortly. We recently relocated our NY datacenter, and therefore all the IPs in NY changed.

    We didn’t update to the new IP for the master DNS server (in both the named.conf and the zone files), and therefore all of our slaves hit our expiry time (one week). This is fixed now, so things should start coming back up. 

    We apologize for the oversight on our part.

    • 4 months ago
  • Outage Dec. 12th, 2012

    We were down a few minutes this morning due to a failure on our linux load balancer:

    Dec 12 14:22:13 or-rtlb01 kernel: igb 0000:07:00.1: Detected Tx Unit Hang

    This happened right after a rebuild of our web socket servers which reconnects approximately 30,000 - 100,000 connections (depending on the time of day).  The load balancer was very unhappy with this, though it’s never been a problem in previous builds.

    We have failed over and back from our secondary load balancer which resumed normal site operations.  Even though the primary load balancer appears to be fine after the kick, we’re not trusting it just yet and have shifted it back in rotation while we fully investigate.

    • 5 months ago
  • Outage Nov. 8th, 2012 Update

    We worked with the current datacenter to find the cause of the outages after the last issue a half hour ago.  A default route was incorrect, and we believe this to be the root cause of the intermittent connection throughout the day - a lingering issue from this morning’s changes on the upstream side.

    PEAK internet has now corrected the route and Stack Exchange should now be online and stable.  We apologize for the inconvenience caused by the interruptions.

    • 6 months ago
  • Outage Nov. 8th, 2012 Update

    Here’s the info from our upstream provider on the outage we experienced earlier this morning:

    I suspect that, when the router under maintenance came back online, there was a couple of minutes of routing inconsistency on our border router connecting to LS Networks (our upstream that feeds the Corvallis office directly). 

    This was unexpected, and should not have happened. The fact that the connectivity issue only arose when the router-under-maintenance came back online (and not when we shut it off), and that the inbound traceroute died on one of our Corvallis border routers, suggests to me that the interaction between OSPF and BGP on the Corvallis routers may be culprit, and I will perform a review of the local routing design.  

    • 6 months ago
© 2012–2013 Stack Exchange Network Status
Next page
  • Page 1 / 2