In this article, an overview of all post mortem reports is given. A post mortem report is an incident report which is created by the Development Team of Ans after an incident occurred and was solved. The post mortem reports are created in the case of a priority 1 incident. A priority 1 incident means the impact and urgency of the incident were both high. An example of a priority 1 incident can be an outage of the whole platform or a malfunction of a business critical part of the platform for all users. An example of a business critical part can be the mark calculation of assignments. Of course, we try to reduce the number of priority 1 incidents as we understand the impact of outages and malfunctions of business critical parts of Ans are disturbing for our users.
Incident summary
Our auto scaling scaled too aggressively, causing our database to run out of memory. The database rebooted itself, which resolved the issue.
The issue was noticed by our performance and error monitoring tool on Wednesday the 2nd of November at 13:39. The technical team started an investigation right away and at 13:43 the cause was identified.
Leadup
The release of October 16th 2022 introduced a new auto scaling algorithm to better scale our application and background servers based on the expected usage of our platform in an automated way. The algorithm scaled too aggressively causing over 1000 connections to the database.
Next to the amount of connections, a change in the codebase caused certain database queries to use more database memory that required while a digital test was taking place. The database server restarted itself due to the increase of database connections and lack of free-able memory.
Impact
The database server rebooted itself in one minute after it went down. During the reboot of the database server, all requests accessing the database were affected by the issue. An example of a request is: Ans attempting to save an answer given by a single participant in a digital test. Between 13:39 and 13:40, around 2975 of the 7162 requests failed (41%). The schools that were conducting digital tests have been informed with a list of the assignments that were ongoing and impacted by this issue. The requests of the participants that were taking a digital test have been retried once the database server was up again.
Response
The Development Team responded to the incident with an investigation into the issue. The issue was identified at 13:43. After identifying the issue, the Development Team started looking into possible solutions. At 14:18, the Development Team started working on several fixes for the issue. The first solution was to improve the database queries that were changed with the release of October 16th. The database query has been improved by adding an index which increases the speed of the database queries and reduces the amount of necessary memory. The second solution was to increase the amount of memory in the database. The first solution was deployed as a hotfix at 21:53 and the second solution was deployed as a hotfix at 22:36.
Recovery
The Development Team has implemented a few measures to prevent this from happening again:
- Increased the memory size of the database server.
- Made future plans (targeted for Q4 2022) to improve the configuration of the database server settings, which will improve the database server memory usage.
- Made improvements to the database queries that affect performance while taking a digital test,
- Improved the monitoring on database memory usage, so we are notified earlier in case of low freeable memory.
- The performance improvements and increase of memory size have been hotfixed the same day.
Timeline
All times are CET.
2nd of November 2022
13:39 - Performance monitoring reports an outage with a lot of errors.
13:40 - Issue resolved itself by a reboot of the database.
13:40 - Development Team starts an investigation of issue.
13:43 - Development Team has identified the cause of the issue.
13:44 - Development Team starts looking into possible solutions for the issue.
14:16 - First notification of a customer towards Ans of error messages.
14:48 - Development Team starts working on the fixes of the possible solutions.
20:50 - The first hotfix which improves the performance of taking a digital test is ready.
21:42 - The second hotfix which increases the memory size of the database server is ready.
21:53 - The first hotfix is deployed.
22:36 - The second hotfix is deployed.
Incident summary
All times in this document are recorded in UTC+2. Except for times mentioned in the root cause analysis of Cloudflare. On the 21st of June 2022 at 08:36 it was reported that Ans was unavailable and several of its services were unavailable. This event was triggered by an issue with Cloudflare services. Ans relies on Cloudflare for its DNS management. The other services besides Ans that were impacted by this incident, include:
- Zendesk
- SorryApp
- Mailerlite
This incident is identified as a priority 1, which is a major incident with a significant impact. Users were unable to contact Ans during this incident due to the unavailability of Zendesk and Mailerlite. This incident affected 100% of users.
Leadup
Cloudflare first reported on the issue in an incident report. This incident report can be found here: https://www.cloudflarestatus.com/incidents/xvs51y9qs9dj. After the issue was fixed, they also reported on the cause and impact of the issue. This report can be found here: https://blog.cloudflare.com/cloudflare-outage-on-june-21-2022/.
Impact
The issue was discovered at 08:36 on the 21st of June, when the Customer Team noticed the unavailability of Ans and several of its services. These services also make use of Cloudflare and include: Mailerlite, SorryApp and Zendesk. Mailerlite is used by Ans to contact its users through email. SorryApp is used to inform the user on the status of Ans via status.ans.app about any issues with Ans. Zendesk is used to respond to support tickets and to answer emergency phone calls. For 33 minutes between 08:36 and 09:09 on the 21st of June, Ans was unavailable alongside these services. Users were unable to contact Ans support due to the unavailability of these services. Ans incorrectly displayed that all systems were operational until 09:04 on the 21st of June, while this was not the case. The status could not be updated due to the unavailability of SorryApp.
Response
The Customer Team informed the Development Team about the issue through Slack on the 21st of June at 08:41. An investigation was opened by the technical team on the 21st of June, at 08:45. Ans was unable to properly communicate during this incident to the customer: It was not possible to respond to tickets or phone calls as Zendesk was unavailable. It was not possible to update the status page as SorryApp was unavailable. It was not possible to send emails through support as Mailerlite was unavailable. Due to the unavailability of the platforms, Ans was not able to share any response following the regular communication procedures.
Recovery
- The Development Team began investigating the issue by locating what caused Ans to be unavailable.
- Once it had been identified that the issue was due to an outage in Cloudflare, the Development Team investigated the impact, by verifying which services Ans uses that are dependent on Cloudflare.
- During the investigation, the Cloudflare status page was being monitored.
- Once a fix had been implemented by Cloudflare, the results of this fix were being monitored.
- Cloudflare has officially confirmed that the issue has been resolved.
- The customers were informed that the issue had been resolved.
We regret that this issue has occurred and we will look for ways to improve our internal processes. This specific incident was caused by an issue at one of our services, so solving this issue was outside of our control. However, this incident did show us that it is possible that all of the services we use can be affected at the same time and to account for this happening.
Backup processes
First of all, we will look into ways for us to improve our back-up process. This incident caused many of our services to be down at the same time. This resulted in us not being able to reach critical information on the processes in place for these situations (i.e. contact information and email services). We will look into developing a process to create and maintain backups to account for this.
Status page
Second of all, we will reflect on how we can improve the updates on our status page to align with possible downtimes with our key services. We will look into ways to automate this process by doing two things: Look into the possibility of manually updating the status page another way and by checking the possibilities of automatically updating the status page by connecting this to our services.
Technical improvements
Thirdly, we will research possibilities for the platform to stay live, even if this would happen again with one of our services. We will take a critical look at our dependence on our services to see if we can make any improvements in that area.
Timeline
21st of June 2022
08:36 - The Customer Team noticed the unavailability of Ans and several of its services.
08:41 - The Customer Team informed the technical team that Ans and several of its services were unavailable.
08:45 - An investigation has been opened by the Development team.
08:59 - The Development team discovers that Cloudflare which is used by Ans and several of its servers suffered an outage.
09:09 - A fix was deployed by Cloudflare.
09:18 - The customers have been informed that the issues have been resolved.
Incident summary
All times in this document are recorded in UTC+2. On the 3rd of May 2022 at 12:49 and 17:45, users reported an issue where multiple choice submissions were improperly recognized for the first form in written exams, causing students to be incorrectly graded. The event was triggered by adding the possibility of shuffling the answer options in the preview of an assignment on April 10th. This addition contained a change to the randomisation of written assignments, causing the seed used for the randomisation to not be saved for the first form. The event was noted by two support tickets being sent to the Customer Team. The Development team started working on the event by investigating what caused the incident. This incident is identified as severity level 3, which is a minor incident with a low impact. However, for users that were impacted by this incident, this incident was identified as severity level 1, which is a major incident with a significant impact. This event affected 0,003% of our users.
Leadup
A change was introduced in the release of April 10th 2022, adding the possibility to shuffle the answer options in the preview of an assignment. The issue was discovered at 12:49 on the 3rd of May 2022, when the Customer Team received a ticket describing that Ans does not register that the answer options have been shuffled. The change resulted in a bug for written assignments where the option 'shuffle the answer options' was enabled. The first form did not contain the seed used for the randomisation and was not recognised as shuffled. Because of this, multiple choice submissions were improperly recognised, causing students to be graded incorrectly.
Impact
Ans has reviewed all assignments that were generated and uploaded between April 11th and May 4th to determine which results were affected. The scope could be limited to assignments which had shuffling enabled, contained a multiple choice question and where the preview (001.pdf) was used by a student. We identified six results of which two were reported to the Customer team. All impacted schools were informed.
Response
A side conversation was opened in Slack by the Customer Team on the 3rd of May at 15:22 informing the Development Team of the issue. An investigation was opened by the Development Team on the 3rd of May at 17:45. A response to the ticket was made on the 4th of May at 09:16 by the Customer Team, informing the user that the Development team will be investigating the issue.
Recovery
First, the Development Team began investigating the issue by locating the change that introduced this issue. Second, a hotfix was made to ensure this issue does not persist. Third, once the breaking change had been found, the Development Team investigated the impact this change had, by looking at which assignments could have been influenced between the period of April 11th 2022 and May 4th 2022. Fourth, the exams that were impacted during this incident were reprocessed and verified, by confirming that the submissions in the assignments are correctly graded. Finally, the customers were informed that the issue had been resolved.
We regret that this issue has occurred and have reviewed our testing procedures to see where we can improve. Ans has three stages of testing, during code review, with automated tests and a manual test in the interface of our testing environment. This issue should have been found during manual testing, but in this case only the generation was tested, not the impact when uploading the generated form. We have expanded the steps in manual user testing to also better look at consequences of a change in a later stage and we have added automated tests to automatically detect this issue.
Timeline
3rd of May 2022
12:49 - The Customer Team received a ticket detailing that Ans does not recognise answers that have been shuffled.
15:22 - A side conversation was opened by the customer team asking the Development Team for more information.
17:39 - The Customer Team received a second ticket detailing that Ans is showing different answer options for an assignment.
17:45 - An investigation has been opened by the Development Team.
18:38 - The Development Team discovers the change that was made which created this issue.
19:01 - The Development Team issues a hotfix, making sure the issue does not persist and starts investigating the impact.
4th of May 2022
09:16 - The issuer of the first ticket was informed that the Development Team is investigating the issue.
09:16 - The issuer of the second ticket was informed that the Development Team is investigating the issue.
10:45 - The Development Team has discovered which users and results were affected by this incident.
11:30 - The Development Team has reprocessed the assignments that were impacted by this incident.
11:50 - The Development Team has confirmed that the issue has been fixed, verified that the assignments are correctly graded and that the issue has been resolved.
5th of May 2022
11:00 - The customers have been informed that the issues have been resolved.
Incident summary
A bug in the code caused grades to be outdated if the points of a multiple choice answer were changed after the test had been taken. It only affects assignments that used the guess correction and only results where the student did not choose the multiple choice answer that was changed.
The bug was noticed by a customer, who sent in a ticket on Friday February 25th at 09:48. The message was shared with the Development Team and an investigation was launched. At 15:05, the cause was identified and at 16:19, the customer was notified. The grades were recalculated for that specific assignment to fix the issue.
Leadup
The bug has been present ever since the guess correction option has been introduced in Ans. The impact of changing the points of a multiple choice question on users that didn’t choose that specific option has never been taken into account.
Impact
After the cause of the incident had been identified, the Development Team continued the investigation to determine the impact on previous assignments. The bug has caused nine assignments to have incorrect grades. All affected results have a too high grade, so no students were disadvantaged.
Response
The Development Team responded to the incident with an investigation into the issue. The issue was identified on Friday February 25th at 15:21. After identifying the issue, the Development Team started working on a fix for the issue. The fix was deployed with a hotfix on Saturday February 26th at 14:56.
Recovery
A hotfix has been deployed and the affected customer has been notified.
Timeline
All times are CET.
25th of February 2022
09:48 - Support ticket sent in by Erasmus University Rotterdam.
12:25 - Ticket forwarded to Development Team for investigation.
14:29 - Development Team start investigation of issue.
15:05 - Development Team has identified the cause of the issue and starts working on a fix
20:42 - Hotfix is ready
26th of February 2022
14:56 - Hotfix is deployed.
Comments
0 comments
Please sign in to leave a comment.