Around midnight (UTC) of Tuesday, September 26th 2023 we received reports from users that their preview sites on *.hlx.page, *.hlx.live, or the Adobe Experience Manager homepage on www.hlx.live or even the status page on status.hlx.live were unavailable. These reports were scattered, inconsistent, and could not be reproduced by the on-call engineers. Over the next 24 hours it became clear that this was related to a wider Domain Name System (DNS) issue that affected the DNS provider for our services.
Our DNS provider declared the issue to be over and resolved, which led us to post a status update declaring the issue to be resolved for our customers. Over the next few days, as reports about DNS resolution issues continued to trickle in, it became clear that this was not the case and a new incident was raised. In our monitoring we could observe raised DNS resolution times of about 600 milliseconds that while well below Tuesday’s peak of 2500 ms, were still significantly above the prior baseline of 200 ms.
This continuing incident was ultimately resolved on September 28th, around 6pm UTC when the entire DNS provider for our services was replaced with new globally distributed DNS infrastructure. As changes to the start of authority (SOA) record take a while to propagate, we resolved the issue a few hours later, after verifying that our services could be reached and resolved globally.
To deliver customer sites, we rely on a second layer of customer-managed content delivery network (CDN) infrastructure. These content delivery networks have more robust DNS resolvers that are capable of handling longer DNS resolution times than most DNS resolvers used by Internet Service Providers (ISPs). This, combined with the high cache efficiency of AEM resources led to a very limited impact in terms of availability of customer sites. Despite this limited impact, we treat this DNS disruption as a major incident and are taking a number of fundamental steps to prevent a re-occurrence: