The sessions expired and logout errors started appearing on September 19th began around 16:09 UTC when we started deploying a new version of the Archive software. Quickly, we started to hear from both volunteers and users that they were receiving Session Expired errors, were being forcefully logged out, and in some cases were unable to login at all.
We tried various troubleshooting steps from a Systems perspective to see if one of our caching layers was causing the issue, however none of these measures produced a positive result, so the decision was made to revert back to the previous version of the Archive code. Because of this, we had to change the running code in a way that is less polite than usual, which did result in some short downtimes.
Unfortunately, we did continue to see session expired errors after the rollback, although at a lower amount. To make sure that there was not an erroneous caching issue, we disabled our Cloudflare cache, and we also bypassed our Cloudflare Worker code which makes some caching decisions and repairs some session issues in case it was causing problems.
After seeing no notable improvement from disabling caching, we reenabled caching at about 11:30 UTC. We also reenabled our worker script at about 11:49 UTC.
For testing purposes, we disabled the worker again at 17:44 UTC before reenabling it at 18:19 UTC. We confirmed that although it did not completely prevent issues, the worker script seemed to be doing its job and was helping at least some sessions remain in place.
After further investigation and debugging, we deployed a new version of our worker script at 16:28 UTC. The version previously in place only repaired requests that were fetching information, whereas the new version also repaired requests submitting information to the site. This may have had an additional positive impact, but it still did not fully resolve the issue.
At 17:00 UTC, we deployed the new version of the Archive that had previously failed, but without some cookie related changes that we believed were responsible for the issue. Interestingly, this appeared to fully resolve the issue.
Although we are missing data to confirm exactly what occurred originally, our overall analysis of the situation is that the multiple changes to how session cookies are handled in the newly-deployed Archive version did not work well in production. First, the cookies were changed to an encrypted format, which caused a problem described below. Second, cookies were abruptly changed from SHA1 to SHA256 as part of a Rails update which caused compatibility issues due to rolling deployments. Users were logged out as a result, something that did not happen in our staging environment (which is smaller and therefore does not take as long to finish deployment).
Additionally, our Cloudflare Worker script relies on being able to read the contents of cookies, so its inability to do so due to the newly encrypted cookies caused it to assume that those requests were not cacheable. This set a flag which told the Archive that the request’s user was signed in, even when they may not have been. When looking at these requests, the Archive application saw a “logged in” user, but without a matching valid session, so it presented the user with the session expired error.
When we made the decision to revert to the previous version of the Archive, we did so manually outside of the typical deployment procedure. Presumably, there was something that either remained from the new version, or was not correctly restored to the old version which was still causing the remnant session expired errors. Additionally, users that had received the new cookie format would have needed to renew their session to receive the correct version which we could read.
When we eventually made the decision to deploy the new version of the Archive again (with the offending cookie changes reverted) through our normal deployment process, the clean deploy cleared up whatever inconsistency had remained, which fixed the session expired errors.
Going forward, if a rollback is needed, we will make an effort to do it through a subsequent automated deploy. We have also had some issues with our deployment process in general, and we are in the process of testing a new deployment process in one of our staging environments, which we believe will increase the reliability of code deployments.