The issue on September 17th actually began as a problem that was not on our end. We received internal reports of issues accessing the site, but our metrics only showed CPU load dropping off by about half.
After running some basic network troubleshooting, we found that we were unable to reach the site when forcing a connection via IPv4, but forcing a connection via IPv6 did work. Given that the IP addresses in question belong to Cloudflare, there was little we could do other than report the issue. The postmortem later posted by Cloudflare confirms that we were early to identify & escalate the issue, as our tweet was posted 11 minutes before Cloudflare declared an incident 😉
After Cloudflare resolved the issue, the Archive briefly went down again, displaying 503 errors. The metrics showed a similar CPU dropoff as we saw on the 17th.
This confirmed to us that the issue appeared to be socket related, as the flood of IPv4 connections returning would have opened a lot of sockets at once. To resolve this, we changed our firewall configuration so that they can communicate with the frontend servers over additional internal IP addresses, which increases the amount of sockets available (this was the previously mentioned “difficult change”, although we ended up applying it to a different set of servers than we were originally planning to). Since making this change, we have not seen any similar disruption.