AO3 database issues

Postmortem

September 2nd

During the elevated 502/503 errors incident, a more serious issue occurred. Two servers dropped out of our database cluster, which left us with only one server attempting to serve traffic and also serve as a source for the other two servers to resync. The end result is extremely poor performance, which leaves the Archive mostly broken.

This loss of sync between the servers appears to be due to a bug with our database software, which we are working with their support to investigate. It did not appear to be related to the app server CPU load issue. To get things back to a working state, we placed the Archive into maintenance mode to allow a second server to rejoin the cluster faster than it would if we tried to keep the Archive online.

September 3rd

At around 05:50 UTC, the second server successfully rejoined the cluster and we brought the Archive out of maintenance mode with Under Attack Mode on to keep bot load down while we had reduced database capacity.

At 14:18 UTC, the 3rd server rejoined the cluster which brought it back to a healthy state. We turned Under Attack Mode back off at 16:15 UTC.

Posted Oct 07, 2024 - 23:55 UTC

Resolved

Due to database servers falling out of sync, the Archive was largely unusable for several hours between 2024-09-02 and 2024-09-03. To let the servers catch up faster, we enabled maintenance mode, during which time the Archive was totally unavailable.
Posted Sep 02, 2024 - 22:53 UTC