All times in the document are recorded in UTC+1 (CEST).
Summary
On April 7th, 2025 at 13:59, our technical team identified multiple instances of an unusual error in our error logging channel. The error stated that the authentication with the object store failed. The object store is a service used to store data online, in this case files uploaded to Ans.
At 14:36, the support team updated the status page notifying our users that we are currently experiencing issues with uploading and downloading of files.
At 17:20, a hotfix was deployed to the production environment that solved the authentication issue with the object store and allowed the uploading and downloading of files again.
After the hotfix was deployed, the technical team started to investigate ways to recover files that were indefinitely stuck. By 18:18, the technical team finished recovering all possible files.
Lead-up
On April 7th, 2025, our technical team identified that there were multiple errors in our error logging channel, stating the authentication failed with the object store.
Fault
During the afternoon of April 7th 2025, our object store provider performed unscheduled maintenance which caused the authentication with our application to fail afterwards.
Impact
For 4 hours between 13:30 and 17:20, it was not possible to upload or export files in certain parts of our platform that involved the object store such as:
- Uploading exercise attachments
- Uploading test instructions
- Copying and converting of assignments with attachments
- Copying exercises with attachments
- Import and export of files that are processed by a background job. These can be found at https://ans.app/background_jobs
On the new scan page files could still be uploaded, but could not be fully processed. The final stage of processing, where separate pages get combined into PDF’s, would fail and attempting to view the created results would show the “Something went wrong” error page. The old scan page and the taking of other assignment types were not affected by this issue.
Recovery
At 17:20, a hotfix was deployed that allowed us to connect to the object store again. The technical team then started investigating the recovery process of the files and were able to recover the following:
- Files that were uploaded on the new scan page
- Copied and converted assignments with attachments
- Export of files processed by a background job
Timeline
7th of April
- 13:59 - An engineer noticed multiple errors stating the authentication failed with the object store provider.
- 14:01 - The technical team confirms it is not possible to upload files in any environment
- 14:15 - The technical team informs the support team of the ongoing issue with the uploading and downloading of files
- 14:34 - The technical team emails the object store provider explaining the current situation and asking if they are experiencing an outage.
- 14:36 - Support updates the status page notifying users that we are experiencing issues with the uploading and downloading of files
- 15:22 - The object store provider responds, stating they performed background maintenance.
- 16:38 - The technical team informs the support team that they have identified the cause of the issue and are verifying a solution.
- 17:20 - Hotfix is deployed to the production environment which lets the application connect with the object store again.
- 17:39 - Support updates the status page to confirm that a hotfix has been deployed that should resolve the issue and that the technical team is still actively monitoring the situation.
- 18:18 - The technical team fixes uploads that could be recovered.
- 18:21 - Support updates the status page to confirm that the incident has been resolved.
Reflection
This incident revealed how even with monitoring in place, the impact of third-party service interruptions can still cause major disruptions in the platform. In this case, our object store provider performed unscheduled maintenance which caused access failures.
Although our monitoring alerts were successfully triggered and we were able to quickly identify the issue. The platform’s dependence on continuous third party availability still led to downtime across several areas within the platform.
Moving forward:
- We are currently in contact with our object store provider and are actively exploring solutions to prevent similar downtime from occurring in the future.
- We are improving our existing processes to better handle third-party unavailability and exploring alternative solutions to implement more robust fail-safes.
Comments
0 comments
Article is closed for comments.