Incident summary
- Investigating the root cause at 19:02
- Applying a DNS change at 19:15 resolved the incident
This incident was handled with our highest priority since it was affecting the availability for all of the users.
Leadup
At 18:58 on 12/06/23, 2 minutes before the incident, a DNS TXT record addition to our DNS settings led to an unexpected removal of DNS CNAME record.
This change led to a misconfiguration of the DNS settings, which made the platform unavailable for all of our users.
Fault
This DNS configuration uncoupled the domain ans.app with the load balancer to the application. Therefore the platform became unavailable.
Impact
Between 19:00-19:22 the platform was unavailable for all users using the domain ans.app. Only users using a custom domain were unaffected.
Detection
The incident was detected when our monitoring system triggered an alert for a possible downtime of our application.
The Cloudflare status page was indicating another incident, which gave the impression that our application was affected by this issue. This resulted in a delay of finding the root cause of the problem.
After manually inspecting the DNS configuration, a missing CNAME was detected which was the root cause of this incident.
Response
After receiving an alert at 19:00, the engineer on call started the investigation at 19:02.
The engineer on call checked the status page of Amazon Web Services and Cloudflare to see if the the incident was caused by one of our vendors. The status page of Cloudflare listed an issue affecting multiple components, which suggested that the issue was caused by Cloudflare. The engineer on call posted an update on our status page that the issue was caused by Cloudflare.
The engineer on call then received a call from a developer stating that they made a DNS change by adding a TXT record. The engineer on call continued with his investigation to see if other websites hosted on Cloudflare were affected, but they were not. This indicated that the problem was not caused by Cloudflare but by the DNS change.
After checking the DNS records, it became clear that the CNAME record which links https://ans.app to our load balancer was missing. The engineer on call added the missing CNAME record and verified if the fix was working by flushing his DNS. Pingdom also picked up the new DNS record and updated the status page that the issue was resolved.
Recovery
After finding the misconfiguration in our DNS settings, the missing DNS record was immediately added. Depending on the refresh cycle of a user's DNS settings, the application became available again.
Timeline
All times are CEST
12th of June 2023
18:58 - A DNS TXT record was added.
18:58 - A DNS CNAME record was automatically removed.
19:00 - Pingdom reported downtime.
19:02 - Engineer on call starts investigation.
19:05 - Possible relation with an open Cloudflare incident.
19:06 - Engineer on call receives a call from a developer where the engineer was informed about a DNS addition 2 minutes prior to the incident.
19:10 - Update on status page mentioning the relation with an open Cloudflare incident.
19:11 - Engineer on call continues investigation by checking all DNS records.
19:15 - The missing DNS CNAME record was added.
19:17 - Engineer confirmed adding a DNS record.
19:22 - Pingdom reported that the application was recovered, which also solved the incident on the status page.
Reflection
The addition of TXT records to our DNS was deemed safe, as the addition should never impact the current DNS records. In this incident, the addition of new records to our DNS settings has caused an unforeseen situation, namely a misconfiguration of our existing settings. Therefore, any change to DNS settings will now be executed procedurally in a maintenance window.
It took some time to identify the missing DNS CNAME record, to improve the response time for DNS issues, we have setup additional DNS configuration checks, so that the root cause for DNS misconfigurations can be found faster.
We have identified an improvement regarding the communication in case of downtime. Several customers called the emergency phone number, but during an outage, we don’t have the capacity to answer all phone calls. In that case, we communicate exclusively via our status page, so all users are simultaneously informed. We will setup an automated message on our emergency phone number in case of a full outage, acknowledging the issue and referring the callers to our status page.
Comments
0 comments
Please sign in to leave a comment.