Stack Exchange Network Status

Here we'll post updates on outages and maintenance windows for the Stack Exchange Network. You can also get status updates by following @StackStatus
Here we'll post updates on outages and maintenance windows for the Stack Exchange Network. You can also get status updates by following @StackStatus
  • rss
  • archive
  • Outage Postmortem - July 20, 2016

    Overview

    On July 20, 2016 we experienced a 34 minute outage starting at 14:44 UTC. It took 10 minutes to identify the cause, 14 minutes to write the code to fix it, and 10 minutes to roll out the fix to a point where Stack Overflow became available again.

    The direct cause was a malformed post that caused one of our regular expressions to consume high CPU on our web servers. The post was in the homepage list, and that caused the expensive regular expression to be called on each home page view. This caused the home page to stop responding fast enough. Since the home page is what our load balancer uses for the health check, the entire site became unavailable since the load balancer took the servers out of rotation.

    Follow-up Actions

    • Audit our regular expressions and post validation workflow for any similar issues
    • Add controls to our load balancer to disable the healthcheck – as we believe everything but the home page would have been accessible if it wasn’t for the the health check
    • Create a “what to do during an outage” checklist since our StackStatus Twitter notification was later than we would have liked (and a few other outage workflow items we would like to be more consistent on).

    Technical Details

    The regular expression was: ^[\s\u200c]+|[\s\u200c]+$ Which is intended to trim unicode space from start and end of a line. A simplified version of the Regex that exposes the same issue would be \s+$ which to a human looks easy (“all the spaces at the end of the string”), but which means quite some work for a simple backtracking Regex engine. The malformed post contained roughly 20,000 consecutive characters of whitespace on a comment line that started with -- play happy sound for player to enjoy. For us, the sound was not happy.

    If the string to be matched against contains 20,000 space characters in a row, but not at the end, then the Regex engine will start at the first space, check that it belongs to the \s character class, move to the second space, make the same check, etc. After the 20,000th space, there is a different character, but the Regex engine expected a space or the end of the string. Realizing it cannot match like this it backtracks, and tries matching \s+$ starting from the second space, checking 19,999 characters. The match fails again, and it backtracks to start at the third space, etc.

    So the Regex engine has to perform a “character belongs to a certain character class” check (plus some additional things) 20,000+19,999+19,998+…+3+2+1 = 199,990,000 times, and that takes a while. This is not classic catastrophic backtracking (talk on backtracking) (performance is O(n²), not exponential, in length), but it was enough. This regular expression has been replaced with a substring function.

    • July 20, 2016 (8:47 pm)
    1. notcnte liked this
    2. sanya-nya reblogged this from stackstatus
    3. ritec reblogged this from stackstatus
    4. fuzzyhorns liked this
    5. thebramp liked this
    6. joeshorriblepuns reblogged this from stackstatus
    7. bakphooon liked this
    8. randomphilosophyideas liked this
    9. 200tiiiiiiiiiiiiiiiiiiiinyhorses reblogged this from allthatglitchesisgold
    10. jbitshine liked this
    11. insanejapanesesquirrel reblogged this from ghaabor
    12. ghaabor reblogged this from stackstatus
    13. ghaabor liked this
    14. cjmlgrto liked this
    15. ledat liked this
    16. x0r liked this
    17. luisherlock liked this
    18. ununnilium liked this
    19. asmeikal liked this
    20. iamabiguy liked this
    21. longestpathsearch reblogged this from stackstatus
    22. longestpathsearch liked this
    23. xogrouptech reblogged this from stackstatus and added:
      Love the detail in the postmortem. Simple reminder: always use the simplest possible health check for your load...
    24. mymomsdog liked this
    25. jazzinsilhouette reblogged this from wabbeldabbel and added:
      This is a personal pet peeve of mine. They used n·(n-1)/2 instead of n·(n+1)/2, thus calculating only the sum of the...
    26. funkeyfreak reblogged this from stackstatus
    27. funkeyfreak liked this
    28. kernalphage reblogged this from stackstatus
    29. kernalphage liked this
    30. ebertek liked this
    31. jurvis liked this
    32. imabug liked this
    33. kade514 liked this
    34. needsmoresalt liked this
    35. kkapadia liked this
    36. honoluabay liked this
    37. viva64 reblogged this from stackstatus
    38. goatsgomoo reblogged this from stackstatus
    39. from-away reblogged this from stackstatus
    40. from-away liked this
    41. neelkadia liked this
    42. avimehenwal reblogged this from stackstatus
    43. avimehenwal liked this
    44. gerrymanderring liked this
    45. joeshorriblepuns liked this
    46. inetgate reblogged this from stackstatus
    47. despairboats reblogged this from stackstatus and added:
      As a wise man once said, You have a problem, You solve your problem with regex, You now have 15 problems.
    48. despairboats liked this
    49. donttrustcats liked this
    50. tedmechanic liked this
    51. Show more notesLoading...
© 2012–2016 Stack Exchange Network Status