Incident summary
All times in the document are recorded in UTC+2 (CEST).
On January 26, 2024, between 6:35-7:48 the platform was unavailable for all users. This incident was caused by a combination of factors that led to the saturation of our database connection pool. This resulted in a lack of available connections, causing a temporary disruption in our services.
Our team has worked diligently to address the issue promptly, and we have successfully resolved the problem. The database connection pool has been optimised to ensure stability, and services are fully restored.
In order to prevent a recurrence of this incident, our engineering team has conducted a thorough analysis to identify the root causes and implemented corrective measures. We are committed to strengthening our infrastructure to enhance its resilience and prevent similar issues in the future.
Leadup
Around 2:00 CEST, our database encountered a surge in activity due to a high volume of write operations on our user table. This spike was primarily attributed to a customer's continuous creation of user accounts and our nightly periodic jobs synchronising with Learning Management Systems.
Fault
As a consequence of this high activity, the write operations led to a locking mechanism within the user table, causing significant delays as the database connections waited on each other. Eventually, the situation escalated to the point where the database ran out of available connections.
Impact
Between 6:35-7:48 the platform was unavailable for all users. Prior to that, the platform experienced degraded performance beginning at 3:04.
Detection
The incident was detected when our monitoring system, Pingdom, triggered an alert for a possible downtime of our application.
We also received notification from our database monitoring system that there were queries running for an unacceptably long time.
After manually attempting to access the database, our connection was refused, which confirmed the fact that connections were being rejected, due to the saturation of the database connection pool.
Response
To address the issue promptly, we took the corrective measure of rebooting our database, effectively releasing the connections.
Recovery
After rebooting the database, the stuck connections were freed, and service was restored. Any of the requests which were pending during the downtime received a failure response to indicate that they had not properly gone through.
Timeline
All times are CEST
26th of January 2024
03:04 - The first short outage is reported
03:37 - The database monitoring sent an alert that queries are taking longer than usual
06:35 - The database connections are at a maximum and all new connections are rejected
07:30 - The engineer on call starts investigating the issue
07:46 - The engineer on call makes the decision to reboot the database
07:48 - The database is rebooted and service is restored
11:45 - The status page is updated by the support team with an explanation of what occurred
Reflection
We noticed that the status page was not automatically updated, which should have happened as soon as our uptime monitoring detected that the application was unavailable. We have found that our status page connection with the uptime monitoring was no longer active. We did not have any monitoring on this connection, so we were not aware that this was broken. The connection has been restored and we have implemented a procedure to regularly check if the connection is active.
Moving forward, we are actively implementing measures to prevent a recurrence of this incident. We have already implemented changes in how we handle user updates, and are increasing the efficiency with which we make updates when faced with high traffic.
We have also implemented measures to prevent this type of situation from occurring in the future. Our team will continue to monitor the application closely to ensure its stability and performance.
Comments
0 comments
Please sign in to leave a comment.