Twitter has an internal root CA problem (opens in new tab)

(izzodlaw.com)

117 pointsloriverkutya3y ago76 comments

76 comments

50 comments · 14 top-level

DrScientist3y ago· 12 in thread

If this is true - who knows - then it reflects rather badly on the people who were fired - as they didn't implement safeguard for a 'run over by a bus' scenario when they were in charge.

rsynnott3y ago

It's normal to plan for scenarios where you abruptly lose some people. It's... less normal to plan for scenarios where you abruptly lose basically everybody; in most cases where that happens the company is basically dead anyway, so they're arguably not worth planning for.

Say you're planning, well, _anything_, and someone says "but in five years, a weird billionaire might buy the company and mismanage it to such an extent that your contingency plans don't work". There's a good argument that the proper response is that (a) that is largely the weird billionaire's problem and (b) that it is impossible to defend against an arbitrarily incompetent speculative future weird billionaire.

If someone takes a hammer to an electricity distribution board and electrocutes themselves, the normal response is not "well, that's the electrician's fault; they should have thought of that".

If true, this would "reflect rather badly" on exactly one person. But, y'know, it'll need to join a rather long queue of poorly reflecting things.

zer0tonin3y ago

There's "run over by a bus" and "90% of the company got ran over by a bus" scenarios. The second one is rarely worth implementing.

a28002763y ago

Not to forget the "90% get fired my an egotistical maniac who expressed his distain by quite publicly calling them lazy useless pieces of shit" scenario. That scenario is also seldom considered.

walrus013y ago

There's also possibility of: "If this person hadn't been fired, they could use some other form of credentials within twitter's internal systems plus a passphrase they have memorized to login to the private-key-repository system where the credentials for the root CA are stored and retrieve them. But as they were fired abruptly they are not inclined to help Musk. And nobody has asked them".

1 more reply

nindalf3y ago

It reflects rather badly on you that you're talking mad shit without knowing their circumstances. What's the bus factor on your systems? Can they handle literally every person being fired overnight?

fredoralive3y ago

Or they had a "run over by a bus" scenario that assumed that the entire team wasn't going to be run over by a bus all at the same time?

tedivm3y ago

I've worked at companies where we've had policies about how much of the company can travel together (ie, how many people can be on the same plane). If your entire team is fired then the company is the problem, not the team.

delfinom3y ago

It's management job to plan for run over by a bus scenario, not the people whose job is to actually implement things.

dragonwriter3y ago

There’s “someone got run over by a bus outside of our control” and “The people in charge direct a bus to run over everyone covering a key function”. You don’t really plan for the latter scenario when you are in charge, instead, you just don’t direct a bus to do that. If your successor decides to do that, that’s…on them.

raizer883y ago

Maybe building it right cost 5x, and you have a budget for 1x. Sometime money is not unlimited even at FAANG

DoctorOW3y ago

To add on, people forget that Twitter was never really FAANG. It not only wasn't profitable but had no monetization plan for years. I'm sure it paid off for all the investors who got Elon's money but even as a Facebook competitor they don't have Facebook money.

2 more replies

ilyt3y ago

We have 7 racks and 3 people working in ops and built Puppet setup "right". It's not hard. And their setup was probably right too.

Just that nobody plans for "bus hit our entire ops team"

1 more reply

8organicbits3y ago· 8 in thread

I'll take the rumor with a grain of salt, but can anyone unpack what the recovery plan would be for something like this? It would obviously be a big problem, but where would you even start?

jon-wood3y ago

Assuming they’ve still got access to the servers themselves via SSH, you’d start by issuing a new root CA cert for the Puppetmaster and putting that in place, then you’ve got to issue a new cert for every client and distributing those. It’s not impossible, but it’s also going to be a pain in the backside to do.

threeseed3y ago

If you read through the guide [1] it requires you to have sudo access to bounce the puppet process on the client nodes.

This is because the whole idea is that you have inaccessible, locked down Production servers that only Puppet (which is driven from a central, governed configuration management source) has authority to configure i.e. no SSH and no root access.

Thus leaving the only option being to physically visit each server at the datacenter and issue the commands.

[1] https://www.puppet.com/docs/puppet/5.5/ssl_regenerate_certif...

justsomeadvice03y ago

Been there before, we did exactly this; except over OOB+reboot-into-single-user (because SELinux). Took us a few days (~5k servers) but managed to get out of it with no public-facing downtime. The other way would have just been to rekick the world one box at a time. A number of integration tests were added after that disaster :)

1 more reply

hayst4ck3y ago

It's nearly impossible to predict recovery without understanding the system. You would probably need to know how ssh is configured, how secrets are managed, and how files are distributed, both before and after puppet.

Circular dependencies can absolutely wreck you. For example, puppet could configure sudoers, and without puppet config being applied people who would normally expect access might not have it. So now you have to find a privileged ssh key for un-configured machines.

I would be surprised if twitter did not have a physical vault with a USB drive with a root SSH key on it. With that you can do just about everything.

I would be most terrified of machine churn. Auto-remediation systems or elastic capacity systems can result in lost capacity that can't come back until the configuration problem is resolved.

ilyt3y ago

Create new root CA, ssh to machine, remove old certs, re-add machine to Puppet, sign the new CSR on Puppet master, then it will download new root.

Very simple operation... if you have working SSH access with root. If they don't, well...

walrus013y ago

If you don't have ssh access with root, hopefully you have access to something like the underlying hypervisor, to do the equivalent of "sudo xl console vmname" on a xen dom0 to get what is logically the same as a physical serial tty (or local vga+keyboard) console on the domU machine.

Or the VMware esxi emulated graphical console, etc.

Or if it's a bunch of bare metal machines, hopefully someone old-school in the organization thought to deploy 48/96-port rs232 console serial concentrators and wire them up to the db9 serial port on each physical server. And you didn't disable all local serial tty in your operating config.

2 more replies

yuppie_scum3y ago

If they're in the cloud, it's pretty straightforward to re-mount the drive somewhere else and replace the SSH keys.

1 more reply

yuppie_scum3y ago

Depends how disposable the individual servers are. I don't know specifics of the Twitter infra, but I would probably just issue a new cert and begin shooting and replacing the old servers. Hopefully the services are abstracted from the puppet cert and things like Redis and whatnot will safely reprovision and find their quorums.

parasense3y ago· 7 in thread

> Musk fired everyone with access to the private key to their internal root CA,

The way forward is to generate a new CA root certificate.

> and they can no longer run puppet because the puppet master's CA cert expired

They can reconfigure internal tools to use the new CA root certificate, or rather one of the signed intermediate certificates.

> and they can't get a new one because no one has access.

They can simply generate new CA root certificates, and sign or create new intermediate certificates.

> They no longer can mint certs.

Yes, they, can...

> My limited understanding in this area is that this is...very bad

No, it, is, not...

There are two immediate issues that come to mind.

* Twitter was so awful before, that it relied on people to safeguard the keys to the kingdom. This is very bad practice, and one of the many things Musk will no doubt be fixing. For any mission critical assets, and especially certificates, but also passwords... current modern day corporate practice is to have a secure ledger of these that can be accessed by the board of directors, the executive managers, and designated maintainers. At no point ever should the password be entrusted to anybody, but rather a "role" that functions as the one who has access. Say for example, the CIO/CTO and their subordinates.

* The Second issue is the one everyone is fixating upon, and that's firing important people who put the company at risk. This is a big issue, and certainly Musk could have done a better job of scoping out who represents a single-point-of- failure at twitter, eliminate that risk, and then proceed with the culling. In a modern enterprise no single person should be capable of putting the entire operation at risk. It's just that simple. So in a way, Musk accelerated what was probably inevitable at Twitter already. They were probably precariously close to destruction already, and now they can learn the hard way of not repeating these mistakes.

spaced-out3y ago

>Twitter was so awful before, that it relied on people to safeguard the keys to the kingdom. This is very bad practice...that can be accessed by the board of directors, the executive managers, and designated maintainers.

LOL, you realize all the PEOPLE you list as the PEOPLE who should be able to manage the keys to the kingdom are PEOPLE? Board of directors - fired on day one of Musk takeover, executive managers - many fired one day one by Musk as well, designated maintainers - for all we know they could have been fired in the purge or quit when Musk offered the 3 month severance.

All system require people to run.

MichaelZuo3y ago

According to the latest publicly available sources there are still many hundreds of folks on the active payroll at Twitter, do you know of any evidence to the contrary?

Volundr3y ago

Significant edits for clarity.

Serious question... How do I build a system that grants access to a company role not a person? In other words, the CIO is fired, how does this system ensure that the new CIO can access it, and the old one no longer can?

If we tie it to the HR system, whoever admins that effectively has the keys to the kingdom. Same for Active Directory or any other technical solution.

caltheon3y ago

Something like the nuclear football is probably the only answer. Something very obvious that is transferred with the role

1 more reply

ryan_lane3y ago

> For any mission critical assets, and especially certificates, but also passwords... current modern day corporate practice is to have a secure ledger of these that can be accessed by the board of directors, the executive managers, and designated maintainers. At no point ever should the password be entrusted to anybody, but rather a "role" that functions as the one who has access. Say for example, the CIO/CTO and their subordinates.

Maybe in hacker movies. In real life, you try your best to avoid anyone having access to keys or passwords, and rely on HSMs, cloud KMS, secret services, etc. Access to those things is controlled by your security team, with multi-factor authentication, often stored in safes, with alerts being fired when they are used (because they should never be used). The audit logs that trigger these alerts should be written in WORM storage, so you can track access back down to individuals, and so that you know when you need to rotate secrets accessed by humans. Ideally your CA infrastructure automatically rotates and distributes.

There's absolutely no way in hell you should allow your board to have access to these things.

Most companies slowly work their way towards full automation, and until that happens, your security team usually owns manual rotations of critical systems like this. Only a fucking moron would fire all of these people.

foolfoolz3y ago

when building secure systems one of the key principles is assume someone will leak the private key. this is how we get to hsm

maybe another one is assume you will lose access to the hsm. sure spinning up a new trust chain is annoying but it wouldn’t take that long to do. totally agree this post is overblown

anecdotal13y ago

spinning up a new trust chain is not so hard, but deploying that trust chain to thousands of servers around the world when your automation tool isn't available to do it with is really, really hard.

1 more reply

donohoe3y ago· 4 in thread

Taking it with a pinch of salt, but this stuff does happen.

I've received calls from past employers, usually when they migrate a site I worked on to a new CMS or platform. There is some critical service (AWS, CDN credentials, domain related) etc. that no one knows who has access... Happily those appear to get resolved... but this... yikes (if true)

plorg3y ago

In a possibly more pedestrian example, my organization needed a re-mailer service set up and found out that the IT worker previously tasked with administration for that service had the MFA set up for his personal phone. I think they eventually got a hold of him to coordinate transfer of credentials, but knowing him, there was a 50% chance he could have left the company on bad terms and would have made things quite a bit more difficult.

tedivm3y ago

I had something similar happen when I left a company, only I'm fairly consistent on deleting credentials to systems I'm not supposed to have access to. Fortunately it was for an internal service and nothing customer facing, so they were able to wipe and redeploy.

Volundr3y ago

One of the first things I do when leaving a company is remove all credentials from my password manager. Sure they should disable my accounts, but on the off chance they don't I still want it clear I don't have access.

It doesn't have to be a departure on bad terms, if they needed my TOTP codes I can't help them. That secret is already gone.

ilyt3y ago

Funnily enough putting it in configuration management (like Puppet) can make it nice and automatic.

But, well, if you fuck up your CM...

hayst4ck3y ago· 3 in thread

I think people who work in reliability see this type of thing as the real existential threat to twitter. It's unrealistic that a large infrastructure would fall over overnight, but what is very realistic is small problems being neglected until they become big problems, or multiple problems happening at the same time.

This alone is probably manageable, it might even be simple but painful to handle for 2-15 of twitters employees (pre-firing) with specialized knowledge. If 3 people knew the disaster recovery plan and they all got fired because they were so busy maintaining things and fighting fires that they failed to get good reviews by building things, well I wouldn't be surprised. Likewise the employees trusted with extreme disaster recovery mechanisms are not the poor souls on H1Bs who don't have the option of leaving easily, so the people trusted with access might have already jumped ship since they aren't being coerced into staying on board with a mad man.

The real existential threat is another problem compounding on top of this or a disastrous recovery effort. Auto-remediation systems could do something awful. A master database could fall over and a replica be promoted, but if that happens twice, 4 times? Without puppet to configure replacement machines appropriately, there could be a very real problem very quickly. Similarly, extremely powerful tools, like a root ssh key, might be taken out, but those keys do not have seat-belts and one command typed wrong could be catastrophic. Sometimes bigger disasters are made trying to fix smaller ones.

Puppet can be in the critical path of both recovery (via config change) and capacity.

ep1033y ago

That's okay, Musk tweeted that Twitter needs a complete, green-field rewrite 5 days ago, I'm sure that will solve the problem.

zamnos3y ago

How many engineers would that take Michael? 4, hardcore, over the weekend?

karmakaze3y ago

Whenever I hear specifics about likely ways things could fail, I always see a plan. "Hey this all makes sense, lets focus on having these areas covered before they come to pass."

Same goes when someone lists all the reasons why a proposal isn't viable. "Great, so we'll address those and be golden then?" Often they list them as fact without considering (or the ability to imagine) that they could be made viable with additional effort.

ilyt3y ago· 1 in thread

Oh, I know that problem, we did change Puppet root CA due to mishap of one of the admins during updating to sha256 certs. But IIRC (it was long time ago) Puppet CA cert by default are issued for like 10-20 years, would be a bit weird if true. Also, old versions didn't had trust chain "just" root CA so puppet master would have to have key for that on disk anyway, proper "root CA + leaf CA for puppetmasters" have been a thing for just few years in Puppet.

It would only be really problematic if they also lost SSH access to those machines using Puppet. If you have root access the fix is not exactly hard.

But then they fired people that did had access so that might also be a problem

We made sure all of our machines can be accesses both by Puppet and by SSH kinda for that reason; we had both accidents of someone fucking up Puppet, and someone fucking up SSH config rendering machines un-loggable (the lessons were learned and etched in stone).

So really, depending on who has access to what, it can be anything from "just pipe list of hosts to few ssh commands fixing it" to "get access manually to the server and change stuff, or redeploy machine from scratch". Again, assuming muski boy didn't fire wrong people

brazzy3y ago

> It would only be really problematic if they also lost SSH access to those machines using Puppet. If you have root access the fix is not exactly hard.

> But then they fired people that did had access so that might also be a problem

Oh my, wouldn't that be delicious...

Gotta wonder how you'd go about fixing that, though. Assuming that those people's access was also tied to their employment and irrevocably voided when they were fired: I guess it would depend on how well those machines are secured against attackers with access to the hardware.

walrus013y ago· 1 in thread

very possibly bullshit but huge if true: https://twitter.com/davidgerard/status/1634633886712954881

nunez3y ago

wouldn't put it past him since he only wanted "builders" and I bet he doesn't consider platform ops "builders" (even though they build tons of stuff; twitter's platform was basically a product in and of itself)

jtvjan3y ago

The certificate for Twitter's hidden service expired a full week ago and they still haven't exchanged it.

castillar763y ago

The really interesting part of this is what else is tied to that CA. If it’s just Puppet, it’s bad enough; internal PKIs have a habit of metastasizing into lots of other places, though, precisely because everything internal trusts them. Worst-case here is that some piece of the internals of the Twitter app relies on things from that CA—-for instance, it relies on packages to do app config changes or updates and the packages have to be signed from that chain or served from something with a cert from it. In that case they’d be hosed: you’d have to replace every copy of the Twitter app. Fairly unlikely, but wouldn’t be the first time I’ve seen it happen.

Beyond that, though: Internal build systems? Data encryption? User client auth to critical services? Internal app mTLS for data exchanges? The list of possibilities goes on and on…

gorjusborg3y ago

https://archive.is/vCoD6

manv13y ago

It's pretty amazing that a formerly public company like Twitter had such shitty documentation/processes/infrastructure.

I thought SOX mandated this sort of internal controls - after all, Twitter basically seems to be full of infrastructure risks that would (and have) negatively impacted them financially in a material way.

No key access? Why didn't they print it out and stick it in a safe deposit box, which is what a couple of startups I've been with have done...along with a couple of other key pieces of paper. Physical backup.

dpkirchner3y ago

Wonder if their servers all share a common NTP server/pool (that they control).

mariusor3y ago

Since about 3-4 days I had issues with using tweetdeck, which doesn't load in Firefox with a key pinning failure. I'm not sure it's related, but it seems like to large of a coincidence not to.

yuppie_scum3y ago

That Mastodon server has a load time problem.. took a solid 30 seconds for me to load

j / k navigate · click thread line to collapse

76 comments

50 comments · 14 top-level

DrScientist3y ago· 12 in thread

If this is true - who knows - then it reflects rather badly on the people who were fired - as they didn't implement safeguard for a 'run over by a bus' scenario when they were in charge.

rsynnott3y ago

If someone takes a hammer to an electricity distribution board and electrocutes themselves, the normal response is not "well, that's the electrician's fault; they should have thought of that".

If true, this would "reflect rather badly" on exactly one person. But, y'know, it'll need to join a rather long queue of poorly reflecting things.

zer0tonin3y ago

There's "run over by a bus" and "90% of the company got ran over by a bus" scenarios. The second one is rarely worth implementing.

a28002763y ago

Not to forget the "90% get fired my an egotistical maniac who expressed his distain by quite publicly calling them lazy useless pieces of shit" scenario. That scenario is also seldom considered.

walrus013y ago

1 more reply

nindalf3y ago

It reflects rather badly on you that you're talking mad shit without knowing their circumstances. What's the bus factor on your systems? Can they handle literally every person being fired overnight?

fredoralive3y ago

Or they had a "run over by a bus" scenario that assumed that the entire team wasn't going to be run over by a bus all at the same time?

tedivm3y ago

delfinom3y ago

It's management job to plan for run over by a bus scenario, not the people whose job is to actually implement things.

dragonwriter3y ago

raizer883y ago

Maybe building it right cost 5x, and you have a budget for 1x. Sometime money is not unlimited even at FAANG

DoctorOW3y ago

2 more replies

ilyt3y ago

We have 7 racks and 3 people working in ops and built Puppet setup "right". It's not hard. And their setup was probably right too.

Just that nobody plans for "bus hit our entire ops team"

1 more reply

8organicbits3y ago· 8 in thread

I'll take the rumor with a grain of salt, but can anyone unpack what the recovery plan would be for something like this? It would obviously be a big problem, but where would you even start?

jon-wood3y ago

threeseed3y ago

If you read through the guide [1] it requires you to have sudo access to bounce the puppet process on the client nodes.

Thus leaving the only option being to physically visit each server at the datacenter and issue the commands.

[1] https://www.puppet.com/docs/puppet/5.5/ssl_regenerate_certif...

justsomeadvice03y ago

1 more reply

hayst4ck3y ago

I would be surprised if twitter did not have a physical vault with a USB drive with a root SSH key on it. With that you can do just about everything.

I would be most terrified of machine churn. Auto-remediation systems or elastic capacity systems can result in lost capacity that can't come back until the configuration problem is resolved.

ilyt3y ago

Create new root CA, ssh to machine, remove old certs, re-add machine to Puppet, sign the new CSR on Puppet master, then it will download new root.

Very simple operation... if you have working SSH access with root. If they don't, well...

walrus013y ago

Or the VMware esxi emulated graphical console, etc.

2 more replies

yuppie_scum3y ago

If they're in the cloud, it's pretty straightforward to re-mount the drive somewhere else and replace the SSH keys.

1 more reply

yuppie_scum3y ago

parasense3y ago· 7 in thread

> Musk fired everyone with access to the private key to their internal root CA,

The way forward is to generate a new CA root certificate.

> and they can no longer run puppet because the puppet master's CA cert expired

They can reconfigure internal tools to use the new CA root certificate, or rather one of the signed intermediate certificates.

> and they can't get a new one because no one has access.

They can simply generate new CA root certificates, and sign or create new intermediate certificates.

> They no longer can mint certs.

Yes, they, can...

> My limited understanding in this area is that this is...very bad

No, it, is, not...

There are two immediate issues that come to mind.

spaced-out3y ago

All system require people to run.

MichaelZuo3y ago

According to the latest publicly available sources there are still many hundreds of folks on the active payroll at Twitter, do you know of any evidence to the contrary?

Volundr3y ago

Significant edits for clarity.

If we tie it to the HR system, whoever admins that effectively has the keys to the kingdom. Same for Active Directory or any other technical solution.

caltheon3y ago

Something like the nuclear football is probably the only answer. Something very obvious that is transferred with the role

1 more reply

ryan_lane3y ago

There's absolutely no way in hell you should allow your board to have access to these things.

foolfoolz3y ago

when building secure systems one of the key principles is assume someone will leak the private key. this is how we get to hsm

maybe another one is assume you will lose access to the hsm. sure spinning up a new trust chain is annoying but it wouldn’t take that long to do. totally agree this post is overblown

anecdotal13y ago

spinning up a new trust chain is not so hard, but deploying that trust chain to thousands of servers around the world when your automation tool isn't available to do it with is really, really hard.

1 more reply

donohoe3y ago· 4 in thread

Taking it with a pinch of salt, but this stuff does happen.

plorg3y ago

tedivm3y ago

Volundr3y ago

It doesn't have to be a departure on bad terms, if they needed my TOTP codes I can't help them. That secret is already gone.

ilyt3y ago

Funnily enough putting it in configuration management (like Puppet) can make it nice and automatic.

But, well, if you fuck up your CM...

hayst4ck3y ago· 3 in thread

Puppet can be in the critical path of both recovery (via config change) and capacity.

ep1033y ago

That's okay, Musk tweeted that Twitter needs a complete, green-field rewrite 5 days ago, I'm sure that will solve the problem.

zamnos3y ago

How many engineers would that take Michael? 4, hardcore, over the weekend?

karmakaze3y ago

Whenever I hear specifics about likely ways things could fail, I always see a plan. "Hey this all makes sense, lets focus on having these areas covered before they come to pass."

ilyt3y ago· 1 in thread

It would only be really problematic if they also lost SSH access to those machines using Puppet. If you have root access the fix is not exactly hard.

But then they fired people that did had access so that might also be a problem

brazzy3y ago

> It would only be really problematic if they also lost SSH access to those machines using Puppet. If you have root access the fix is not exactly hard.

> But then they fired people that did had access so that might also be a problem

Oh my, wouldn't that be delicious...

walrus013y ago· 1 in thread

very possibly bullshit but huge if true: https://twitter.com/davidgerard/status/1634633886712954881

nunez3y ago

jtvjan3y ago

The certificate for Twitter's hidden service expired a full week ago and they still haven't exchanged it.

castillar763y ago

Beyond that, though: Internal build systems? Data encryption? User client auth to critical services? Internal app mTLS for data exchanges? The list of possibilities goes on and on…

gorjusborg3y ago

https://archive.is/vCoD6

manv13y ago

It's pretty amazing that a formerly public company like Twitter had such shitty documentation/processes/infrastructure.

dpkirchner3y ago

Wonder if their servers all share a common NTP server/pool (that they control).

mariusor3y ago

Since about 3-4 days I had issues with using tweetdeck, which doesn't load in Firefox with a key pinning failure. I'm not sure it's related, but it seems like to large of a coincidence not to.

yuppie_scum3y ago

That Mastodon server has a load time problem.. took a solid 30 seconds for me to load

j / k navigate · click thread line to collapse