Service Incident - 2nd of November 2022 - Temporary database unavailability – Ans

Incident summary

Our auto scaling scaled too aggressively, causing our database to run out of memory. The database rebooted itself, which resolved the issue.
The issue was noticed by our performance and error monitoring tool on Wednesday the 2nd of November at 13:39. The technical team started an investigation right away and at 13:43 the cause was identified.

Leadup

The release of October 16th 2022 introduced a new auto scaling algorithm to better scale our application and background servers based on the expected usage of our platform in an automated way. The algorithm scaled too aggressively causing over 1000 connections to the database.
Next to the amount of connections, a change in the codebase caused certain database queries to use more database memory that required while a digital test was taking place. The database server restarted itself due to the increase of database connections and lack of free-able memory.

Impact

The database server rebooted itself in one minute after it went down. During the reboot of the database server, all requests accessing the database were affected by the issue. An example of a request is: Ans attempting to save an answer given by a single participant in a digital test. Between 13:39 and 13:40, around 2975 of the 7162 requests failed (41%). The schools that were conducting digital tests have been informed with a list of the assignments that were ongoing and impacted by this issue. The requests of the participants that were taking a digital test have been retried once the database server was up again.

Response

The Development Team responded to the incident with an investigation into the issue. The issue was identified at 13:43. After identifying the issue, the Development Team started looking into possible solutions. At 14:18, the Development Team started working on several fixes for the issue. The first solution was to improve the database queries that were changed with the release of October 16th. The database query has been improved by adding an index which increases the speed of the database queries and reduces the amount of necessary memory. The second solution was to increase the amount of memory in the database. The first solution was deployed as a hotfix at 21:53 and the second solution was deployed as a hotfix at 22:36.

Recovery

The Development Team has implemented a few measures to prevent this from happening again:

Increased the memory size of the database server.
Made future plans (targeted for Q4 2022) to improve the configuration of the database server settings, which will improve the database server memory usage.
Made improvements to the database queries that affect performance while taking a digital test,
Improved the monitoring on database memory usage, so we are notified earlier in case of low freeable memory.
The performance improvements and increase of memory size have been hotfixed the same day.

Timeline

All times are CET.
2nd of November 2022

13:39 - Performance monitoring reports an outage with a lot of errors.
13:40 - Issue resolved itself by a reboot of the database.
13:40 - Development Team starts an investigation of issue.
13:43 - Development Team has identified the cause of the issue.
13:44 - Development Team starts looking into possible solutions for the issue.
14:16 - First notification of a customer towards Ans of error messages.
14:48 - Development Team starts working on the fixes of the possible solutions.
20:50 - The first hotfix which improves the performance of taking a digital test is ready.
21:42 - The second hotfix which increases the memory size of the database server is ready.
21:53 - The first hotfix is deployed.
22:36 - The second hotfix is deployed.