Service Incident - 6th of May 2024 - Incorrect detection for multiple choice questions when using multiple alternatives and corrections – Ans

All times in the document are recorded in UTC+1 (CEST).

Summary

On the 18th of April, 2024, at 13:03, the support team received a ticket where the sender reported that the recognition of multiple response questions in written assignments was not always correct. The support team required more information and received this on the 24th of April, 2024, at 15:07. The support team forwarded the ticket to the technical team on the 25th of April, 2024, at 09:32.
The technical team was initially unable to reproduce the issue, which is why the investigation took longer than expected. The technical team successfully reproduced the issue on the 30th of April, 2024, and created a fix, to ensure that multiple response questions with multiple states remain ungraded. This will be an initial fix, and a task has been created to investigate improving the recognition of multiple response questions. The fix was included in the patch release which was available on production on the 4th of May, 2024, at 20:45.

Lead-up

The support team received a ticket on the 18th of April, 2024, at 13:03. In this ticket, the sender reported that the recognition of multiple response questions in written assignments was not always correct. The support team responded on the 18th of April, 2024, at 13:11, that more information was required, such as documentation and screenshots. The user provided this information on the 18th of April, 2024, at 14:04. The support team then asked the sender for links to the affected results on the 18th of April, 2024, at 14:40, to which the sender responded on the 24th of April, 2024, at 15:07. The ticket was forwarded to the technical team on the 25th of April, 2024, at 09:32, after which the investigation was continued.

Fault

The multiple response question uses a neural network to detect selected choices and has higher complexity when making this detection compared to regular multiple choice questions. The reason for this is that it contains more recognition states, such as being able to cross out corrections.

The neural network is trained on thousands of checkboxes that have been checked in various states and produce a prediction when given a question answered by a student. Even though thousands of different cases are taken into account, there is still a very small chance the neural network miscategories a checkbox as the difference between states can sometimes be subtle.

Multiple response answers are categorized in four different states.

0: The box has been left empty.

1: The box has a cross.

2: The box is fully coloured.

3. The box has been corrected because there is cross outside of the fully coloured box.

Previously, if an answer only had boxes classified as state 1, those were selected. If an answer had boxes classified in state 1 and 2, only those in state 2 were selected. Boxes in state 3 were always ignored as they are seen as being crossed out.

This strategy meant that if three boxes were checked, with two of them classified as category 2, and the third as category 1 (for example, if that box was less-completely filled in, but still intended to be “fully coloured”), then only the two category-2 boxes would be counted as selected. This way of grading was introduced in the release of September 2nd, 2022.

Impact

The impacted results did not always have their choices correctly recognised, when the choices were selected using multiple corrections.

An impact analysis was performed by the technical team in order to find the results that were affected. When a PDF is uploaded to Ans, we split that PDF into separate jpg images and combine these pages into one PDF per participant. The prediction of the choice recognition is stored alongside the jpg images. Since these individual jpg images are only used temporarily in order to score the choices, Ans removes these individual jpg images after 90 days. Because of this, the impact analysis only takes the impact of the last 90 days into account.

The analysis was performed on the scans uploaded between the 7th of February 2024 and the 7th of May 2024 and scoped to questions that contained multiple corrections and were not altered by a reviewer. The analysis was later manually filtered to only include results where responses were incorrectly identified.

The investigation resulted in 54 affected assignments, with 774 impacted submissions. This represents 2,19% of total multiple response submissions over that time period.

Detection

The findings of the user who sent the ticket were forwarded to the technical team for further investigation. The technical team was initially unable to reproduce the issue, which resulted in the investigation taking longer than expected.

Response

The technical team started an investigation after being notified by the support team. Once the issue was verified, a fix was created and included in the patch release of the 5th of May, 2024.

Recovery

To prevent multiple response questions with ambiguous answers from being incorrectly graded, we have made changes to the way they are recognised.

With this update, we will no longer automatically grade the multiple response question if we identify different states of answers. Instead, the question will remain ungraded and will require manual review. This change is a safety precaution to prevent multiple response choices from being graded incorrectly.

Timeline

18th of April, 2024

13:03 - Support receives a ticket indicating that the grading of multiple response questions was not always correct.
13:11 - Support requests more information from ticket sender, such as screenshots.
14:04 - User responds to support with requested information.
14:40 - Support team asks the user for links to the affected results.

24th of April, 2024

15:07 - The customer responds with the results.

25th of April, 2024

09:32 - Support team forwards the issue to the technical team for further investigation.

30th of April, 2024

14:24 - The technical team was able to reproduce the issue.

2nd of May, 2024

11:41 - Technical team determines that the root cause of the issue is the imperfect tuning of the choice detection (when determining between the different ambiguous recognition categories), and that the safest fix is to force ambiguous answers to be manually graded.

4th of May, 2024

20:45 - Fix is deployed on the production environment.

Reflection

When we introduced the grading system for multiple response questions in September 2022, we should have prioritised the reliability of the recognition rather than solely focusing on the ability to grade automatically. We recognised that the system was not 100% accurate, which can be attributed partly to the imperfect nature of trained classifier models, and also due to the variations in the level of care with which students completed checkboxes. However, we failed to clearly communicate this to our end users. We now realise that better communication was necessary, and that the previous level of error should not have been tolerated. Consequently, we have implemented additional safeguards to prevent ambiguous answers from being automatically graded. Furthermore, we are actively exploring measures to enhance the accuracy of our classification system for answers that still undergo automatic grading.

However, it is important to state that even with the newly added safeguards, which will greatly decrease the chance of incorrect automatic grading, it is impossible to guarantee 100% correctness for all response classification. For this reason, it is always a good idea to manually verify automatic grading, specifically for multiple response questions, so that any potential issues can be found. More details of our grading process can be found in our help center article.

Following a meeting with the customer who initially reported this incident, we have become aware of the need to evaluate our reporting interval. While we typically update our customers as soon as we have sufficient information to provide a detailed explanation of the incident, we recognise that this process can sometimes take too long for our customers, who must also manage the expectations of their internal stakeholders. Therefore, we are currently investigating the tools available within Zendesk to establish reporting interval deadlines for high-priority incidents.

Actions

☐ Improve accuracy of classification system

☑ Improve reporting interval deadlines in Zendesk

Version	Date	Information
v1.0	10-05-2024	Initial version
v1.1	17-05-2024	Updated scope and impact in Impact section
v1.2	24-05-2024	Added version table and further details in Reflection section

Related articles