Between February 15th and February 16th, 2021, Google Voice, Google’s telephone service, experienced a worldwide outage that prevented its users from making and receiving Voice over Internet (VoIP) calls for over four hours. On February 28th, 2021, Google published an incident report stating the cause of the outage – an expired TLS certificate.
Google’s Root Cause Analysis goes as follows:
Google Voice uses the Session Initiation Protocol (SIP) to control voice calls over Internet Protocol. During normal operation, Google Voice client devices aim to maintain continuous SIP connection to Google Voice services. When a connection breaks, the client immediately attempts to restore connectivity. All Google Voice SIP traffic is encrypted using Transport Layer Security (TLS). The TLS certificates and certificate configurations used by Google Voice frontend systems are rotated regularly.
Due to an issue with updating certificate configurations, the active certificate in Google Voice frontend systems inadvertently expired at 2021-02-15 23:51:00, triggering the issue. During the impact period, any clients attempting to establish or reestablish an SIP connection were unable to do so. These clients were unable to initiate or receive VoIP calls during the impact period. Client devices with an SIP connection that was established before the incident and not interrupted during the incident were unaffected.
Where do certificates feature here? As Google stated, calls are encrypted using TLS, and TLS uses TLS certificates to authenticate the endpoints (in this case, the caller and receiver applications – the Google Voice frontend systems) before initiating the session. When the TLS certificate expired, the protocol could no longer perform authentication, which led to the failure to establish new sessions.
Expired certificates have been the culprit behind a long line of high-profile outages and data breaches, such as Equifax, LinkedIn, Ericsson, and so on. Certificate expiry isn’t a problem in itself – certificates have a set validity period beyond which they are unsafe for use due to the risk of keys getting compromised. The issue is when organizations fail to renew certificates on time. This can happen for three reasons:
- The alerting system is faulty – it does not send repeated reminders when a certificate is approaching its expiration date.
- The system sends reminders, but the details are grossly inadequate – it does not provide the location, type, issuing CA, etc. Manually finding out these details and renewing the certificate is a long process.
- The certificate renewal process is manual. In this case, the alerting system works just fine, but on receiving the alert, security engineers need to raise a CSR manually, download the certificate, and provision it to the endpoints. This process can easily take a few hours, during which time the application stays down.
With endpoints and applications seeing exponential growth on one hand and certificate life spans getting shorter by the year on the other, it’s curious that even large enterprises such as Google are still grappling with the age-old problem of unforeseen certificate expirations.
What can organizations do to eliminate outages due to expired certificates? The solution is to improve visibility into the certificate landscape and automate certificate lifecycle management. That, of course, is easier said than done, as an overwhelming number of organizations still manage their certificates on spreadsheets or use primitive, home-grown tools. Some organizations do use dedicated certificate management tools that provide advanced certificate monitoring and alerting, but they lack automation and that is a major setback.
Below are three key steps organizations can take to proactively eliminate certificate expiry-related outages:
Ramp up visibility: The only way to make sure no certificate expires unbeknownst to anyone is to have a thorough scanning process that brings every certificate to the light. The scanning should go both broad and deep – exposing certificates of applications and devices outside the network perimeter and also certificates that are buried deep within the network. The scanning tool should also provide details about each certificate, such as its location, CA, and expiration date.
Embrace automation: Enable not just alerting but also automation of certificate expiration events. Automating certificate lifecycle management – enrollment, provisioning, renewal, and revocation – helps keep certificates up-to-date and effectively eliminates outages. Processes such as policy management can also be automated for better security. Automation also helps enable cryptographic agility – organizations can stay on top of protocol and algorithm upgrades to offer the best possible protection under all circumstances.
Enable self-servicing: Allow application and network teams to self-service certificate provisioning, renewal, and revocation to move fast in case of emergencies. PKI teams may be wary of letting other teams handle something as sensitive as certificates, but implementing role-based access controls and privileges can limit what the teams can see and do and keep them well-protected.
AppViewX CERT+ protects your applications and endpoints by automating certificate lifecycle management. CERT+ gives you complete visibility into your encryption key infrastructure, enabling you to predict and prevent outages. It tracks certificates in real-time and provides a unified view of the statuses, endpoint locations, respective CAs, and also sends periodic alerts when a certificate nears its expiry.