On March 28th, 2016, we lost question access across the network for approximately 12 minutes. The big change in play here was moving the static content for every Q&A site to a different directory path to help developers and designers both work more efficiently and enable some new functionality like per-site mobile themes. It was also a tech-debt item from the current structure simply not scaling well to the hundreds of sites we now have.
The content move itself was changing in the solution:
Old path: StackOverflow\Content\<sitename>
New path: StackOverflow\Content\Sites\<sitename>
The actual changes here included moving about about 5,600 files and many pathing references between them. On the static content side, everything was working in the initial deploy, but references on the sites to those resources could not be updated immediately due to the nature of a rolling build. To resolve this we opted for a simple route: don’t remove the old content. This worked fine in dev, but the new build parameter to our deployment PowerShell script that controls this behavior: -UpdatedStaticContentOnly did not save correctly when applying to the production build. This resulted in all of the old static content disappearing at the same time as the move. Mirroring the current build’s folder has always been the existing behavior, but rarely do we remove things, so it has not been an issue thus far.
The Image/Style Problem
Basics: the StackOverflow\Content folder maps to the root of the website sstatic.net and cdn.sstatic.net.
The net result of the instant content wipe was a race created against our CDN caching before people noticed content behind that CDN was gone, rather than an indefinitely allowed transition to new content paths. In other words: the expected behavior was to reference old content for another few hours after the build and migrate the rest, using our traffic logs to see what was still getting hit in the old locations to ensure we got them all. The actual behavior was referencing now missing content.
This resulted in styles and images rendering incorrectly in several dozen instances across the network. They were (for instance) pointing at sstatic.net/stackoverflow/<path> instead of sstatic.net/Sites/stackoverflow/<path>. We then had to fix these ASAP, since the content at those URLs would start to disappear as CDN caches decayed and expired. This led to the second set of downstream problems.
The Questions Problem
The actual problem with questions was a result of the same move, but in a different way. The \Content folder is not only used for static content but also content used in the application. This includes question & answer templates for our editing. These templates are used for things like community ads. While the failure to load the content shouldn’t have prevented showing questions (this has now been fixed), a bug in the temporary multi-pathing fixes (supporting both old and new folders) didn’t properly correct where it was loading from and triggered a startup failure.
Here’s the chunk of code that failed, with an example variable for Stack Overflow:
The problem here is that Path.Combine, on Windows, uses the “\” separator for directories, not “/” as we’d expect in a URL. By checking for “/Sites/”, the intention of making /content/stackoverflow point to /content/Sites/stackoverflow work. However, it didn’t correctly prevent the /content/Sites/stackoverflow case from changing, and it became /content/Sites/Sites/stackoverflow. This is because the Path.Combine call outputs: /Content\Sites\stackoverflow, with a “\”. When Stack Overflow (or any other Q&A site) used this bad path to load templates for questions and answers, a DirectoryNotFound exception was thrown.
Why didn’t we see this immediately when we changed settings and tested on a few sites? The code running here only runs once on application start, so while changing site settings is instant and affects most things instantly, this particular issue only manifests after a restart. When fixing a few other image pathing issues and releasing a build, we inadvertently caused this code to start failing due to the restart associated with every build.
The fix was to remove the temporary if statement and stop the double replacement. We did so immediately and released another build to production.
How do we prevent this in the future?
Unfortunately, there are fundamental limits to testing locally vs. development and production when a CDN is involved. This precludes local testing for some elements, especially in cases where new content is needed and it’s not on the CDN yet. The root of the issue with old content disappearing was known. Script changes to deployment and the new flag copying only deltas (and no longer mirroring) was in response to that. Unfortunately, that fix itself didn’t go live correctly, causing the rest of the downstream issues and creating urgency.
The loading of questions depending on something most sites don’t even have is simply unacceptable. That code path has been isolated and made optional. We’d definitely rather have no templates running with a background error than failing to serve question pages. We will also review all code looking at URLs, and not treat them as file paths anywhere - there are differences and that’s what ultimately caused the worst of the issues here.