Frequently Asked Questions
We are truly sorry for the technical issues that impacted you all between the 5th and 12th of January 2021 and for the delays with notifications, scheduled observations, and antivirus scans that some of you experienced during the following week.
We would like to thank you for your patience and understanding while we investigated the issue and made changes to help mitigate it.
This FAQ page details what the problem was and the changes we made. If there is anything you would like to know, but that isn’t covered on this page, please email firstname.lastname@example.org.
Page last updated: 13:15 GMT Tuesday 23rd February
What were the symptoms of the issue?
As users moved around Tapestry, they were taken to error pages. In some cases that was a 500 error page (a drawing of two dogs, a server, and one of our devs), and in some cases, a 504 error (a text page). Either way, whatever they were trying to do would not have worked and they would have needed to try again.
It affected some Tapestry accounts more severely than others. Most people experienced errors to an extent, but some people experienced a lot of them. The error rates were not consistent within accounts, meaning users may have experienced quite a lot of them in a short burst, and then no more for a while.
To help reduce these we turned off ‘background processes’ as soon as we could see load on our servers was high. What this meant in terms of symptoms was:
- Notifications stopped going out.
- Media uploaded successfully, provided an error page wasn’t shown, but was not ‘processed’ and therefore wasn’t visible.
- Documents uploaded successfully, provided an error page wasn’t shown, but the virus check we run on them was not started and therefore they were not downloadable.
- PDFs set to export did not become available to download.
- Scheduled observations were not published.
These things were not lost, they were put into a queue and went out when our servers were stable – usually overnight.
Did high traffic cause it?
Whilst the issue did occur because of very high traffic, it was not one of scalability, but rather a deeper technical problem. The exceptional level of activity caused a chain reaction on our database servers. Those with technical expertise may want to read our technical explanation.
What did you do to fix it?
As soon as we saw the servers starting to struggle, we turned off what we call ‘background processes’. This reduced the strain on our servers and meant users were more likely to be able to move around Tapestry, view existing posts, and add new posts successfully. However, this was not a long term solution.
In terms of longer-term fixes, our first steps were to:
- Increase our database servers to the largest ones that our hosting company, AWS, provides.
- Set up systems to watch closely what was happening to our databases when these errors started – this is known as logging. That gave us a more detailed insight into the problem.
Across the following week we:
- Made background changes to Tapestry to share the load more evenly across our database servers.
- Took steps to help reduce the chance of certain actions clashing with each other, slowing the system down.