Incident summary
All times in this document are recorded in UTC+2. Except for times mentioned in the root cause analysis of Cloudflare. On the 21st of June 2022 at 08:36 it was reported that Ans was unavailable and several of its services were unavailable. This event was triggered by an issue with Cloudflare services. Ans relies on Cloudflare for its DNS management. The other services besides Ans that were impacted by this incident, include:
- Zendesk
- SorryApp
- Mailerlite
This incident is identified as a priority 1, which is a major incident with a significant impact. Users were unable to contact Ans during this incident due to the unavailability of Zendesk and Mailerlite. This incident affected 100% of users.
Leadup
Cloudflare first reported on the issue in an incident report. This incident report can be found here: https://www.cloudflarestatus.com/incidents/xvs51y9qs9dj. After the issue was fixed, they also reported on the cause and impact of the issue. This report can be found here: https://blog.cloudflare.com/cloudflare-outage-on-june-21-2022/.
Impact
The issue was discovered at 08:36 on the 21st of June, when the Customer Team noticed the unavailability of Ans and several of its services. These services also make use of Cloudflare and include: Mailerlite, SorryApp and Zendesk. Mailerlite is used by Ans to contact its users through email. SorryApp is used to inform the user on the status of Ans via status.ans.app about any issues with Ans. Zendesk is used to respond to support tickets and to answer emergency phone calls. For 33 minutes between 08:36 and 09:09 on the 21st of June, Ans was unavailable alongside these services. Users were unable to contact Ans support due to the unavailability of these services. Ans incorrectly displayed that all systems were operational until 09:04 on the 21st of June, while this was not the case. The status could not be updated due to the unavailability of SorryApp.
Response
The Customer Team informed the Development Team about the issue through Slack on the 21st of June at 08:41. An investigation was opened by the technical team on the 21st of June, at 08:45. Ans was unable to properly communicate during this incident to the customer: It was not possible to respond to tickets or phone calls as Zendesk was unavailable. It was not possible to update the status page as SorryApp was unavailable. It was not possible to send emails through support as Mailerlite was unavailable. Due to the unavailability of the platforms, Ans was not able to share any response following the regular communication procedures.
Recovery
- The Development Team began investigating the issue by locating what caused Ans to be unavailable.
- Once it had been identified that the issue was due to an outage in Cloudflare, the Development Team investigated the impact, by verifying which services Ans uses that are dependent on Cloudflare.
- During the investigation, the Cloudflare status page was being monitored.
- Once a fix had been implemented by Cloudflare, the results of this fix were being monitored.
- Cloudflare has officially confirmed that the issue has been resolved.
- The customers were informed that the issue had been resolved.
We regret that this issue has occurred and we will look for ways to improve our internal processes. This specific incident was caused by an issue at one of our services, so solving this issue was outside of our control. However, this incident did show us that it is possible that all of the services we use can be affected at the same time and to account for this happening.
Backup processes
First of all, we will look into ways for us to improve our back-up process. This incident caused many of our services to be down at the same time. This resulted in us not being able to reach critical information on the processes in place for these situations (i.e. contact information and email services). We will look into developing a process to create and maintain backups to account for this.
Status page
Second of all, we will reflect on how we can improve the updates on our status page to align with possible downtimes with our key services. We will look into ways to automate this process by doing two things: Look into the possibility of manually updating the status page another way and by checking the possibilities of automatically updating the status page by connecting this to our services.
Technical improvements
Thirdly, we will research possibilities for the platform to stay live, even if this would happen again with one of our services. We will take a critical look at our dependence on our services to see if we can make any improvements in that area.
Timeline
21st of June 2022
08:36 - The Customer Team noticed the unavailability of Ans and several of its services.
08:41 - The Customer Team informed the technical team that Ans and several of its services were unavailable.
08:45 - An investigation has been opened by the Development team.
08:59 - The Development team discovers that Cloudflare which is used by Ans and several of its servers suffered an outage.
09:09 - A fix was deployed by Cloudflare.
09:18 - The customers have been informed that the issues have been resolved.
Comments
0 comments
Please sign in to leave a comment.