We were down a few minutes this morning due to a failure on our linux load balancer:
Dec 12 14:22:13 or-rtlb01 kernel: igb 0000:07:00.1: Detected Tx Unit Hang
This happened right after a rebuild of our web socket servers which reconnects approximately 30,000 - 100,000 connections (depending on the time of day). The load balancer was very unhappy with this, though it’s never been a problem in previous builds.
We have failed over and back from our secondary load balancer which resumed normal site operations. Even though the primary load balancer appears to be fine after the kick, we’re not trusting it just yet and have shifted it back in rotation while we fully investigate.