Changes in automatic grading of paper-based assignments – Ans

With the release on January 5th, 2025, we introduced a new recognition model for automatically grading multiple-choice and multiple-response questions in written assignments.

Along with this, we introduced a new instruction for filling in multiple-choice and multiple-response questions.

Terminology
To clarify the terminology used in the rest of this article, the new recognition model recognises the following four states:

State 0: The circle/box has been left empty.
State 1: The circle/box has a cross inside it.
State 2: The circle/box is fully coloured.
State 3: The circle/box has been corrected because there is a cross outside the fully coloured circle/box.

Changes to recognition
The new recognition model was trained to recognise these states, but unfortunately, it mistakenly identified fully coloured circles with a cross outside the circle (state 3) as fully coloured circles (state 2) in multiple-choice questions. More details about this incident can be found here.

To address this, we temporarily disabled automatic grading of multiple-choice questions when more than one answer option is selected on February 7th, 2025.

Apart from this incident, we received feedback that the amount of manual reviewing increased significantly compared to the previous recognition model. Before the release of January 5th, 2025, around 3% of multiple-choice questions were left for manual review, while the new recognition model leaves around 10% open for manual review. This difference was due to the stricter confidence threshold of the new recognition model, which only automatically graded a multiple-choice question if it was more than 97% certain about the selected answer.

Since we temporarily were only grading multiple-choice questions where a single answer option was selected, we lowered the recognition model's confidence threshold to 90% with the March 2nd, 2025, release. In this case, the recognition model only needs to distinguish between empty and selected circles.

Recent improvements (Released on April 27th, 2025)
Over the past weeks, we have improved the recognition model for multiple-choice questions, and we would like to share the details of these changes, which will be released on April 27th, 2025.

To enhance multiple-choice recognition accuracy, we have:

Significantly increased the amount of training data
We increased the training data from 50,000 to 240,000 circles taken from written assignments processed between December 2024 and March 2025 to train the recognition model.
Conducted extensive manual testing
We manually filled out forms to mimic student behaviour and uploaded them in our test environment to verify that the recognition model functions as expected.
Expanded our automated testing
We have strengthened our continuous integration test suite to validate future changes using real-world examples. Our continuous integration tests contain assignment types, different types of paper and different types of scanners to cover all scenarios. Per scenario we cover all states of answers and with every change we make to the model or processing of the model output, our integration tests check the impact of the change.
Introduced variable confidence thresholds for different states
The recognition model easily distinguishes between empty, crossed-out, and fully coloured circles. However, differentiating between fully coloured circles (state 2) and fully coloured with a cross outside the circle (state 3) is more challenging.
- For states 2 and 3, the recognition model must be at least 97% confident before automatically grading state 2.
- For states 0 and 1, the recognition model must be at least 93% confident before automatically grading.
All states must meet the required confidence, otherwise the response will require manual review. These thresholds were chosen based on recognition model validation with 80,000 answer options (25% of the classified circles), resulting in zero misclassifications.

Requiring manual review for state 3 (fully coloured with a cross outside the circle)
Our manual review of responses during the incident investigation revealed that student intent can be ambiguous when a cross is placed outside the circle.
Since determining intent requires analysing the full question, we will never automatically grade answers classified as state 3. This change also applies to multiple-response questions.
Reenabled automatic grading of single response multiple-choice questions in case multiple answer alternatives are selected
In response to the February incident—where the recognition model mistakenly interpreted fully coloured circles with a cross outside the circle (state 3) as fully coloured circles (state 2)—we temporarily disabled automatic grading for single-response multiple-choice questions when multiple answers were marked.
Now that the model has been significantly improved through increased training data, refined confidence thresholds, and enhanced testing, we are confident in the accuracy of state detection. As a result, this functionality has been reenabled.

Handling state 3 in your workflow
Some customers have expressed that they do not wish to use state 3 and prefer only using states 0, 1, and 2. While we understand this preference, we believe that including state 3 results in more accurate grading. Without state 3, the recognition model selects the darkest bubble, which may not reflect the student's intent. By leaving state 3 for manual review, reviewers can ensure students receive the correct grades.

There is always the option to disable the written assignment instructions provided by Ans (per individual assignment) and add your own. Some customers require students to use a new form if they wish to make a second correction. This approach remains fully compatible with the new recognition model, as it ensures only states 0, 1, and 2 are present, which will be graded automatically if the confidence threshold is met.

Balancing accuracy and manual review
The new confidence thresholds were determined based on recognition model validation, resulting in zero misclassifications based on 80,000 circles. That said, no recognition model is perfect, and there will always be a small chance of misclassification. We estimate that the chance of misclassification is less than 0.003% with the new recognition model. This estimation is based on decreasing the confidence levels slightly which resulted in 2 misclassifications out of a validation set of 80,000 circles (~0.0025%).

Based on the validation size of 80,000 circles, we estimate that around 6% of multiple choice questions will require manual review—a trade-off between reducing manual effort and maintaining accuracy. We believe this strikes a good balance between automation and reliability.

Please note that our model is trained using the specific states outlined in the front-page instructions. Any deviations from these instructions—such as circling a checkbox or marking it with an off-centered or unconventional cross—may not be recognised by the model as valid selections.

Given our current confidence levels, such instances are highly likely to require manual review.

Related articles