On March 15, 2025, between 7:00 AM and 11:21 AM UTC, approximately 90% of customers experienced issues with the delivery of RUM (Real User Monitoring) scripts due to availability problems with the unpkg.com backend. The incident was caused by unpkg.com experiencing degraded performance, combined with our system's inability to properly handle first-byte timeout errors in the backend fallback mechanism. The issue was identified and resolved by adjusting timeout configurations and routing traffic through Fastly, which provided better control over backend timeouts.
It is important to note that while monitoring alerts were triggered, there was no impact on end users whatsoever. As the RUM script is loaded as a non-blocking module, a failure to load does not impact page rendering, resulting in zero percent of page views being affected. All pages continued to load normally for all visitors throughout the incident. Less than one percent of RUM script requests failed during this period, but these failures were completely transparent to end users.
Approximately 90% of customers were affected by this incident, but there was no impact on end users whatsoever. Less than one percent of RUM script requests failed during this period. As the RUM script is loaded as a non-blocking module, these failures did not impact page rendering for any visitors, resulting in zero percent of page views being affected. End users experienced no degradation in service, performance, or functionality throughout the entire incident.
In week-to-week comparisons of collected RUM data, the incident is not visible and falls below the normal sampling error threshold. The monitoring alerts that were triggered can be considered false positives from a page delivery perspective, as they require full page delivery and cannot distinguish between blocking and optional resources.
The root cause of this incident was the degraded performance of the unpkg.com backend, combined with limitations in our fallback mechanism. Our system has built-in redundancy and is designed to switch backends automatically when one fails. However, this fallback mechanism did not properly account for first-byte timeout errors, which was the specific failure mode in this incident.
The unpkg.com backend has shown intermittent reliability issues in the past, which compounds the problem. When unpkg.com began experiencing degraded performance, our system continued to attempt to use it rather than switching to alternative backends in a timely manner.
Our team implemented a two-part solution to resolve the issue:
* Initially, we set the timeout to 1500ms, which reduced the impact but did not fully resolve the issue.
* We subsequently reduced the timeout to 1ms, which effectively disabled the unpkg backend entirely and routed all traffic through our alternative backends.
These changes successfully resolved the issue, with all monitoring alerts returning to normal shortly after implementation. After confirming the stability of the system, we gradually reverted to the original traffic split and backend timeout configurations while maintaining a watchful eye on system performance.
Our monitoring systems were effective in detecting the issue quickly, allowing for a rapid response. A war room was established within minutes of the initial alert, and the team was able to implement mitigating measures within approximately one hour of detection.
While the monitoring system successfully alerted us to the issue, it is worth noting that the alerts were overly cautious and notified customers of what was essentially a non-impacting issue from an end-user perspective. This created unnecessary concern among customers.
We have identified the following key action items to prevent similar incidents and improve our response capabilities: