Incident Summary
All times in this document are recorded in UTC+2 (CEST).
On September 7th, 2023, at 15:36, the support team received a ticket stating that the formula described in the Help Center for calculating the P-value did not align with the value displayed in the question insights. The technical team was forwarded the ticket on September 8th at 12:04 and investigated whether the formula was indeed incorrect. The technical team discovered that the formula described in the Help Center was indeed incorrect and notified the support team on September 13th at 14:27 to update the article.
While validating the formula the technical team discovered that there was a discrepancy between the values presented in the interface and the values when using the existing formula in the codebase. After this discovery an investigation was started to determine the cause and the impact of the issue.
Lead-up
The support team received a ticket on September 7th at 15:36, stating that the formula described in the help center for calculating the P-value did not align with the value displayed in the question insights.
Fault
During the investigation the technical team had discovered three edge cases in which the recalculation of the P-, Rit- and Rir-values would not be triggered. The first instance would occur when the participant submitted an unanswered question. This would not trigger the recalculation of the P-value. The second issue would occur when a participant submitted an assignment in which they received zero points in total for the assignment. This would not trigger the recalculation of the Rit- and Rir-values. The third issue would occur when a result was removed. This would not trigger the recalculation of the P-, Rit- and Rir-values.
It should be noted that if any grade of any participant had changed in the assignment, then all the Rit- and Rir-values would be recalculated. The P-value would also be recalculated when a question was answered by another participant and would then take the unanswered questions into account in the calculation.
It was shown that the issue had been present since 23rd of July 2022.
Impact
An initial impact analysis was performed by the technical team by using a subset of samples (35.307 questions) and comparing existing analysis values to the recalculated values. The subset was made up of all questions that were used in assignments in the past 6 months and contained more than 100 results. It is also stated within the platform that Ans cannot be certain the insights are reliable when it contains less than a 100 results. A subset was chosen due to the potential severity of the impact and attempting to verify all the analysis values would otherwise take a significantly longer time.
The results show that out of 35307 questions:
- 1.754 questions (5%) contained an incorrect P-value, with a mean deviation of 0.02
- 871 questions (2%) contained an incorrect Rit-value, with a mean deviation of 0.03
- 971 questions (3%) contained an incorrect Rir-value, with a mean deviation of 0.03
The mean deviation was calculated by taking the mean of the absolute difference of analysis values on the impacted questions.
Detection
The incident was detected when the support team received a ticket stating that the formula for calculating the P-value was incorrect. While investigating these reports the technical team discovered additional issues regarding the analysis values.
Response
The support team informed the technical team about the initial issue of the incorrect formula on September 8th at 12:04. The technical team started an investigation and confirmed that the formula was incorrectly documented, while investigating the technical team discovered other issues regarding the analysis values. On September 13th at 14:27 the technical team informed the support team that the formula should be updated on the help center and that the technical team will investigate further about the discrepancy in the analysis values. On September 15th at 16:24 a discovery was made as to what could have caused the discrepancy of the P-values. On September 19th a discovery was made as to what could cause the discrepancy for the Rit- and Rir-values. Due to the high amount of possibly impacted questions a subset of samples was used to determine the initial impact. While the impact analysis was running and being investigated, a hotfix was deployed on September 21st at 16:15 to prevent this issue from recurring. Furthermore, a task was created to rectify the existing incorrect values. An additional task was also created to write a post-mortem to inform our users regarding the issue and what the next steps are in resolving it. These tasks are scheduled to be completed within a week.
Recovery
On September 21st at 16:55, a hotfix was deployed to prevent this issue from occurring again. Additionally, a patch release mail will be sent to all administrators which contains the issues that have been resolved in the past week. The hotfix is only applied for newly submitted assignments to ensure the same issue does not occur again. A task has been created to recalculate the impacted values and is scheduled on September 23rd. We estimate that the recalculation will be completed in the week of September 25th.
Update:
The recalculation of the P, Rit and Rir values has completed on September 24th at 17:01.
Timeline
7th of September 2023
- 15:36 - Initial ticket with the issue about the formula being incorrect in the help center
- 17:02 - Support team sent a reply to the customer that the issue would be investigated
8th of September 2023
- 12:04 - Issue is shared with the technical team
13th of September 2023
- 14:27 - Response from the technical team stating that the formula is indeed incorrect and notice of the analysis values being different when calculated and when displayed in the interface
15th of September 2023
- 16:24 - Possible cause for the issue of incorrect P-values has been discovered and shared with the Support team.
- 16:49 - Task created to fix the issue of the incorrect P-values.
19th of September 2023
- 16:44 - Discovery of the incorrect Rit- and Rir- values, alongside the incorrect P-values.
20th of September 2023
- 14:45 - Technical team performs the impact analysis of the subset
- 14:55 - Technical team has started the development of the hotfix
21st of September 2023
- 15:10 - A task was created to update impacted analysis values
- 21:19 - Hotfix to ensure the issue does not persist has been deployed
23rd of September 2023
- 20:00 - Initiated the recalculation of the P, Rit and Rir values
24th of September 2023
- 17:01 - Completed the recalculation of the P, Rit and Rir values
Reflection
The issue remained unnoticed as there were few occurrences, and no errors which were directly visible to the user or would show up in Ans’ internal monitoring system. We have included an internal test alongside the hotfix which will be run in our continuous integration process to prevent the issue from occurring again. We will start an investigation on how to improve our internal testing procedures, to improve the detection of edge cases during the development and testing process.
Comments
0 comments
Please sign in to leave a comment.