About our architecture
We have CloudFront which splits off image / video /static loads direct to S3.
The rest of the traffic arrives at a load balancer and from there to a fleet of EC2 instances which auto scale.
The web servers connect to S3 for video / image work, and to a series of AWS Aurora MySQL database servers for the database work.
The web servers offload async work to a second fleet of EC2 instances to do video / image processing / email sending etc.
We’ve been able to scale up the web and background fleet without issue. The S3 and Cloudfront have scaled without issue.
Our issue has been with the database servers. They are now on the largest servers that AWS can provide.
Although they mostly work fine, there is some failure mode that appears to happen with low probability. But when traffic is high, this failure mode does happen, which leads to a sudden connection pile up on the database servers and they hit their max connection limit. This then cascades onto our other database servers. Once in this high connection state, they do not recover quickly or easily under load.
What we’re doing
We are investigating a diagnosis for this.
We continue to work on a range of things that we think could be the root cause and might fix the issue, as yet without success.
Please know that we are devasted by these performance problems and we are putting all our efforts into resolving them as soon as possible.