AO3 elevated 502 and 503 errors

Postmortem

August 13th Elasticsearch Changes

On August 13th, 2024, we made some changes to our server pools to account for increased load against Elasticsearch, the software that powers our search & filtering systems. This increased load was causing intermittent but noticeable slowness and/or 502 errors when users attempted to access their bookmarks. We intend to more permanently resolve this issue by purchasing new Elasticsearch servers whenever AMD’s Epyc Turin processors become available. However, as a stopgap measure, we pulled a pair of our application servers over to Elasticsearch duties, which greatly helped the situation.

This is an important context to keep in mind, as this change reduced the amount of capacity that is available in the main application server pool.

August 24th

August 24th, 2024 at around 15:40 UTC is where we began to see the first signs of issues. Our Tag Wrangling committee began to report slowness of tag wrangling pages and the site in general, in addition to some 502 & 503 errors. We noticed some very weird CPU load patterns on our application servers.We initially responded by turning on Cloudflare’s Under Attack Mode to shed any automated requests that might be contributing to the problem. After some investigation, we believed a 3rd party proxy to be the cause of the problem, took measures to block its traffic, and disabled UAM. In hindsight, we were wrong.

At around 17:45 UTC, the issue returned, and we re-enabled Under Attack Mode, which appeared to help the issue. We did some further investigating and set some Cloudflare rules to challenge requests from certain Autonomous System Numbers which seemed to be sending a lot of search requests. After disabling Under Attack Mode, things seemed stable, so we continued to believe this was the cause of the issue, which was likely also an incorrect assumption.

August 27th

At around 03:48 UTC, we saw a continuation of the slowness issues and noticed increased requests to the work search path yet again. We were continuing to treat this as the cause of the issues.

We noticed during this iteration of the issue that the load was also peaking on our frontend servers, which are responsible for traffic shaping & load balancing between our application servers and serving static files such as stylesheets.

We turned on Under Attack Mode once more, but this time it did not appear to help the issue. We tried doing a rolling restart of the Archive software on our application servers, which did help during this iteration. We again disabled Under Attack Mode after the load returned to normal.

At around 16:00 UTC, the issue began again, and the initial response was to lower some rate limits and block some proxies that we consider respectful since it looked like a lot of search traffic was suddenly coming via them. There was some internal discussion about whether this was actually the case, and we ended up reverting the proxy related changes.

At 21:40 UTC, we had another reoccurrence of the issue. We ultimately ended up reloading HAProxy (a load balancing software) on our frontend servers, which seemed to help the situation.

August 28th

We deployed some changes to our HAProxy configuration on our frontend servers to allow more connections between the frontend servers and the application servers, which we thought would resolve the problem.

September 1st

At around 20:00 UTC, the issue occurred yet again, but appeared to recover on its own. We believed this was a sign that our previous change had helped the situation.

September 2nd

At around 16:45 UTC we again received internal reports of slowness issues.

Under Attack Mode was enabled again–which helped–and it was again disabled about 45 minutes later.

A couple hours later, the issue returned once more. We enabled Under Attack Mode and reloaded HAProxy on our frontend servers which seemed to help, so we disabled Under Attack Mode once again.

September 3rd

At 20:40 UTC, the spiking issue returned. We attempted the HAProxy reload on our frontend servers which had worked before, but it unfortunately did not help that time. Under Attack Mode and a rolling restart of the Archive application service also did not appear to help. We attempted to put the Archive into maintenance mode briefly to see if that would help, but it only slightly did so.

At 22:32 UTC, we tried restarting HAProxy on our application servers based on a hunch about it using high amounts of CPU. For context, we utilize HAProxy on those servers to load balance between our Elasticsearch servers and to provide a backup path to our database servers if our primary DB load balancing solution fails. The restart was successful and returned load to normal. We dropped Under Attack Mode a couple minutes later which did not have any noticeable negative effect.

This gave us a massive clue into resolving the issue. While the CPU load we based the restart on did not appear to be abnormal, this clearly showed that the issue lied in HAProxy on our app servers.

September 6th

At 21:40 UTC, the issue reoccurred and was resolved with an app server HAProxy restart, further confirming to us that this is the area causing the problems. About an hour later, we applied a change to our HAProxy configuration to remove it from part of the request flow where it was unnecessary.

Root Cause & Solution

Since applying the relevant changes, we have not seen any recurrence of that particular issue, so we believe we have the correct picture of what was occurring.

Prior to our changes, all requests went via HAProxy on each application server before being passed to the actual Archive software. The original reasoning for this was to allow some extra control over the ordering of requests when the servers are too busy, however in practice we are not busy often enough to take advantage of this.

Due to the lower application server capacity, we believe HAProxy on individual servers was getting overwhelmed by TCP port exhaustion and sockets waiting longer than necessary to close. This caused a particularly significant slowdown for all requests since database & Elasticsearch requests route via HAProxy. The slowdowns would result in users attempting to retry their requests, which likely explained the increase in requests we saw. These additional requests, combined with the original load, likely contributed to the CPU load spiking as requests stalled out and were subsequently retried.

The solution we implemented was removing HAProxy on the app servers from the main request flow, and utilizing it only for Elasticsearch/DB routing. Additionally, we increased the maximum connections allowed between our frontend servers and the application servers.

Posted Oct 07, 2024 - 23:50 UTC

Resolved

Ongoing wrangling slowness and increased incidence of 502 and 503 errors from 2024-08-24 to 2024-09-06.
Posted Aug 24, 2024 - 15:40 UTC