AO3 unavailable on some networks

Postmortem

The issue on September 17th actually began as a problem that was not on our end. We received internal reports of issues accessing the site, but our metrics only showed CPU load dropping off by about half.

After running some basic network troubleshooting, we found that we were unable to reach the site when forcing a connection via IPv4, but forcing a connection via IPv6 did work. Given that the IP addresses in question belong to Cloudflare, there was little we could do other than report the issue. The postmortem later posted by Cloudflare confirms that we were early to identify & escalate the issue, as our tweet was posted 11 minutes before Cloudflare declared an incident 😉

After Cloudflare resolved the issue, the Archive briefly went down again, displaying 503 errors. The metrics showed a similar CPU dropoff as we saw on the 17th.

This confirmed to us that the issue appeared to be socket related, as the flood of IPv4 connections returning would have opened a lot of sockets at once. To resolve this, we changed our firewall configuration so that they can communicate with the frontend servers over additional internal IP addresses, which increases the amount of sockets available (this was the previously mentioned “difficult change”, although we ended up applying it to a different set of servers than we were originally planning to). Since making this change, we have not seen any similar disruption.

Posted Oct 08, 2024 - 00:03 UTC

Resolved

AO3 users over several ISPs were unable to access the Archive. ISPs who could fall back to IPv6 (primarily cell providers) were not impacted.
Posted Sep 17, 2024 - 17:50 UTC