During the elevated 502/503 errors incident, a more serious issue occurred. Two servers dropped out of our database cluster, which left us with only one server attempting to serve traffic and also serve as a source for the other two servers to resync. The end result is extremely poor performance, which leaves the Archive mostly broken.
This loss of sync between the servers appears to be due to a bug with our database software, which we are working with their support to investigate. It did not appear to be related to the app server CPU load issue. To get things back to a working state, we placed the Archive into maintenance mode to allow a second server to rejoin the cluster faster than it would if we tried to keep the Archive online.
At around 05:50 UTC, the second server successfully rejoined the cluster and we brought the Archive out of maintenance mode with Under Attack Mode on to keep bot load down while we had reduced database capacity.
At 14:18 UTC, the 3rd server rejoined the cluster which brought it back to a healthy state. We turned Under Attack Mode back off at 16:15 UTC.