On April 6, 2021, American video game and software developer, Epic Games, experienced a massive outage that resulted in the failure of multiple internal systems, logins, live services, and their game launcher. Root cause analysis exposed the culprit as an expired wildcard TLS certificate used for an internal DNS zone. The expiry, although internal, triggered a chain of events which culminated in both internal and external systems being impacted.
Epic Games rose to the situation. Instead of being cagey about the outage, they did a surprisingly candid self-exposé so that others might learn from their mistakes and avoid similar outages. Here’s what they said (you can read the full report here)-
On April 6, 2021, we had a wildcard TLS certificate unexpectedly expire. It is embarrassing when a certificate expires, but we felt it was important to share our story here in hopes that others can also take our learnings and improve their systems. If you or your organization are using certificate monitoring, this may be a good reminder to check for gaps in those systems.
The certificate that expired was used across many internal-facing services at Epic. Too many, in fact. Despite our best efforts to monitor our certificates for expiration, we didn’t fully cover every area where certificates were being used. Following the expiration and renewal of the certificate, a series of unexpected events unfolded that extended the outage.
“Fully covering every area where certificates are used,” or in other words, thorough certificate discovery, is never an easy task. Certificate monitoring tools seldom have full-fledged discovery capabilities, and some certificates always slip through the net. The solution is to have an IP-based scanning tool that integrates with the organization’s DNS and does a lookup to discover certificates. Some certificate lifecycle management tools have this capability built-in, and additionally, integrate with dedicated third-party scanning tools to fulfill this function.
The undiscovered TLS certificate was only half the issue – after discovery, the team had a tough time renewing the expired certificate. Epic Games used AWS ACM (AWS Certificate Manager) to issue and manage certificates, but internal certificates lacked automated renewals. Like many others, Epic Games believed that certificate monitoring would save them from outages, but they learned that this was a fallacy the hard way. A statement made by them goes thus-
Automatic renewals were not enabled for this internal certificate, and the work needed to accomplish this had not been prioritized when identified earlier this year. We have the appropriate systems and services in place to facilitate automatic renewal, but migrating to use these features was not completed in advance of this incident. With our existing monitoring systems, we believed that we were more protected than we actually were from the dangers of certificate expiration. We will be working on moving this certificate and others to automated renewals. In the interim, we have completed a manual audit of all of our certificates.
Automating certificate renewals is one of, if not the most, important part of certificate management. A certificate monitoring tool only alerts on impending expirations (which also it failed to do in this case), and renewing the expiring certificate is usually left to the security team to do manually. Certificates that are not renewed before their expiration always cause outages of some scale. In a situation such as this, where the damage grew exponentially with every passing second of the outage, the criticality of on-time renewals becomes even more pronounced. Some certificate lifecycle management tools have automation built-in; based on the certificate policy, they automatically get a certificate from the CA by generating a new CSR or using an old CSR and install it on the endpoint. Automating certificate renewals is a foolproof way of preventing unplanned expirations and outages.
If there’s anything we can learn from Epic Games’ experience, it is that just certificate monitoring doesn’t cut it, and certificate lifecycle management tools are a must-have to prevent outages in the cloud era. Next-Gen CLM tools like AppViewX CERT+ provide DNS-based smart discovery that discovers certificates deep inside the network and also on edge, mobile, and IoT endpoints. CERT+ has a context-aware workflow automation engine built-in, which does CA-agnostic automated certificate renewals using protocols like ACME, EST, SCEP, and CMP, and also through REST API.