Imagine you inherit a library. The previous librarian retired. Before her, the one before her also retired. Before that, two more. Nobody left a complete catalog. There are books on shelves you haven't found yet. There are books out on loan to people who left the city years ago. There are entire rooms locked, because nobody remembers where the keys are. You are now the librarian.
And the library is not even static. There is a Library Forum - a body you did not elect and cannot ignore - that periodically rewrites the rules for everyone. This year it decides every book must be re-bound into 90-page batches, so volumes that used to sit untouched for years now come back for rework every few months. It re-arranges the labelling system, so spines that scanned fine yesterday no longer match the catalogue. Now and then it strikes a publisher off the approved list entirely, and overnight every book from that author has to come off the shelves - whether or not anything was wrong with the copy in your hands. And the emergencies arrive without notice: a set of master keys is found copied, so a whole wing has to be re-locked at once; a binding glue turns out to be unsafe, and every book that used it is recalled the same week. None of this is your doing. All of it lands on you. That is the CA/B Forum shortening certificate lifetimes, browser vendors reshaping their trust stores, a CA being distrusted, keys compromised, and mass revocations - the external machinery of PKI and web security, moving on its own schedule, not yours.
That is what managing TLS certificates at scale feels like - once you cross a few hundred, and even more so once you are into the thousands.
It is not difficult in the way that, say, distributed consensus is difficult. It is hard in a more boring way: there is too much of it, the pieces are scattered, the people who put them there are gone, and the small mistakes compound into expensive ones.
Here's a framework that works. Not glamorous. Not novel. But it survives contact with reality, which is more than most certificate management diagrams can claim.
The four hard parts
When teams say "managing certificates is hard," they're usually mixing up four separate problems. It helps to pull them apart.
One: inventory. You don't know what you have. Not really. You think you do. You don't.
Two: ownership. For every cert you do know about, who's responsible when it goes wrong at 2am? Often the answer is "Bob - but Bob left in 2023."
Three: renewal coordination. Many certs renew automatically. Many of them fail silently. Some renew but get installed in the wrong place. Some renew but the web server doesn't reload. Automation is necessary but not sufficient.
Four: posture, not just expiry. A cert can be valid and still be bad - weak ciphers, wrong hostname, broken chain. Expiry is the noisy failure mode. The others are silent, and they bite when an auditor or an attacker notices before you do.
Solve all four. Solving any three leaves you exposed.
Building the inventory you don't have
Most cert inventories are wrong. Specifically, they're incomplete in three predictable ways.
The spreadsheet drift problem. The spreadsheet was right when someone built it. Then microservices spun up new subdomains. Then marketing launched a campaign domain. Then a contractor put up a status page. Each of these added certs. Nobody updated the sheet.
The internal PKI shadow. Your external cert count is usually the smaller number. Internal CAs - the one for your service mesh, the one for your VPN, the one for the development environment - often outnumber external certs ten to one.
The inherited acquisitions. When you bought that startup last year, you bought their cert mess too. Their renewal calendar didn't transfer. Their cert email distribution list points to inactive mailboxes. Their staging certs are still being renewed by an automation nobody can find the source code for.
The fix is not "a better spreadsheet." It's continuous discovery from multiple angles:
- Crawl all known domains and find subdomains via Certificate Transparency logs (a free public stream of every cert issued by trusted public CAs). Pull from your cloud accounts (AWS ACM, Azure Key Vault, GCP Certificate Manager) on a schedule. Pull from your CDN (Cloudflare, Fastly) and load balancers. Pull from your internal CAs (Vault PKI, ADCS, EJBCA, step-ca). Reconcile. The list you actually own is the union of all of these, not any one source.
Once a quarter, walk this list with a real human and ask one question: "do we still need this cert?" Half of them, you don't. Sunset the dead ones. A smaller, accurate inventory beats a bigger, hopeful one every single time.
Ownership without single points of failure
Every cert needs an owner. This sentence sounds obvious and turns out to be the hardest part of the whole problem.
The bad pattern: a cert is "owned by" a person. Their personal email gets the renewal notices. Their calendar holds the reminder. When they leave, the cert orphans. Six months later it expires. Surprise outage. Surprise blame meeting. Surprise post-mortem.
The good pattern: a cert is owned by a team, represented by a shared mailbox or a Slack channel that survives turnover. The renewal notice goes to the team. Two or three people see it. Someone takes the ticket. Even if that someone is on holiday, the next person picks it up.
There's a third pattern that looks good but isn't: putting every cert under one central "PKI team" that owns everything. This works at the smallest scale and breaks at the largest. The PKI team becomes the bottleneck, starts saying no to fast-moving product teams, and certs get spun up outside the official process. You're back to shadow inventory, but now with politics.
The right answer is federation. Each product team owns its own certs. The PKI team owns the policy, the monitoring, and the audit trail. Like the difference between owning a car and maintaining the road network - different responsibilities, same goal of nobody crashing.
Renewals - the operational reality
You will automate as much as you can. This is correct. But automation alone is not a strategy. Three things matter on top.
Independent monitoring. Your renewal script needs an outside checker. If the script runs but the cert never gets installed, the script will report success and the production endpoint will quietly serve an expired cert. The only way to catch this is a check that doesn't trust the renewal pipeline - something that talks to your endpoint from the outside, the way a real user would.
Renewal windows, not deadlines. A cert that expires next Tuesday is not a Monday-evening problem. It's a "renew it three weeks ahead, observe the new cert in production for a few days, then let the old one expire" problem. Treat the expiry date as the back wall, not the target.
Graceful failure paths. When automation fails - and it will - what happens? Does anyone know? Does the alert go to a team? Is there a runbook? "Re-run the script" is not a runbook.
Beyond expiry - the silent failure modes
If you only monitor expiry, you are missing the failure modes that cost the most.
Chain breaks. An intermediate CA changes. Your cert still validates in some browsers, fails in others. The error rate creeps up. Customer service tickets blame "the internet."
Weak ciphers. A cipher gets deprecated. Your cert is still valid; your server still accepts the cipher; modern clients refuse to negotiate. Mobile users start failing first because their TLS libraries update faster than your servers.
Hostname mismatches. A new subdomain points to a wildcard cert that doesn't cover it. Or a SAN list got truncated during a renewal. The cert is fine. The match isn't.
Misissuance. A CA issues a cert for a domain you don't own - Certificate Transparency tells you about it, but only if you're watching the logs.
The expiry date is the loud part of the problem. Everything else is quiet, and quiet is what gets through.
What good looks like
When you've built this right, three things are true.
You can answer "how many TLS certificates do we have, and where are they?" in five minutes. Not days. Not after an emergency Slack thread. Five minutes.
When a cert is going to expire, three different people on the right team know about it ten days in advance. Not one person. Not a calendar invite in someone's deactivated account.
When the upstream goes wrong - a CA has a bad day, a chain changes, an intermediate gets revoked - your monitoring catches it before your customers do.
This is not glamorous. It is mostly plumbing. But it is the difference between cert management as a quiet background process and cert management as a recurring source of 2am alerts.
TLS Radar is built around this framework: external monitoring, complete inventory from multiple sources, team-based alerts, and checks that go beyond expiry to chain, cipher, hostname, and revocation. The free tier covers three domains, which is enough to test whether the model fits. The paid tiers scale to organisations where the math on building this yourself stops adding up.
One small ask
If you cannot answer "where are all our certs?" in five minutes, that is your starting point. Everything else builds on it.
Get the next post in your inbox
TLS monitoring tips and product updates. No spam, unsubscribe anytime.