Broken VPNs, the Year 2038, and certs that expired 100 years ago (opens in new tab)

(theregister.com)

153 pointskdp7472y ago57 comments

57 comments

49 comments · 13 top-level

eszed2y ago· 13 in thread

This is a great mystery story, with a satisfying ending. And this

> I generally start troubleshooting an issue by asking the system what it is doing," explained Zimmie. "Packet captures, poking through logs, and so on. After a few rounds of this, I start hypothesizing a reason, and testing my hypothesis. Basic scientific method stuff. Most are simple to check. When they're wrong, I just move on. As I start narrowing down the possibilities, and the hypotheses are proven, it's electric. Making an educated guess and proving it's right is incredibly satisfying.

is an approach every every one of us should internalize.

stouset2y ago

Binary search (or bisecting) is also an incredibly valuable approach that I don’t see junior and intermediate engineers reach for nearly as often as they should.

When some thing is failing, find a midpoint between where things are working and where the bug is manifesting. Do you see evidence of the bug? If not, look earlier in the pipeline. If so, look later. Repeat.

In my experience this process is the primary distinguisher between those who flail around looking for a root cause and the people who can rapidly come to an answer.

eszed2y ago

Good call. When you've got no idea where to start, that's how to start.

Mostly, though, I think people "flail" because they don't know the pipeline well enough to even do that. I know I've been in that position before, when approaching completely new (to me) systems. (Sometimes there isn't someone more knowledgeable you can ask!) That's where I find hypothesis -> test -> refine particularly useful. You're still wrong far, far more often than you're right, but it stops feeling like flailing, and more like making progress towards understanding the system well enough to apply other techniques (whatever they might be) more smartly.

BHSPitMonkey2y ago

`git bisect` is one of those things I wish I'd internalized sooner in my career. It can be so incredibly powerful, especially when you just hand it a shell script (`git bisect run`) and let it rip without having to guide it by hand.

marcus0x622y ago

Once someone understands a complex system well enough to find a good midpoint, are they still a junior engineer?

EnigmaFlare2y ago

I use this technique all the time to help people who are stuck with problems using software - often cause by bugs. Divide-and-conquer quickly isolates the issue. I try to share the technique when I use it, or just offer it as a suggestion.

Part of why it's so useful is you hardly have understand anything about the system internally. Just reduce the complexity of what you're doing until it works to find the lower bound if you don't already have a working case.

That random guessing is like gambling - you hope for a big quick payout but when your hypothesis fails, you end up worse off than before. Wasted time and no closer to the solution.

paulddraper2y ago

100%

I've wondered why this isn't second nature to engineers, junior or otherwise.

Maybe they don't really understand the pipeline? ("I enter the value in the web form and it just appears in the database.")

drewzero12y ago

I think I kind of internalized that idea from my early soft eng courses; after seeing how efficiently a computer can find a result by cutting the set in half repeatedly, I've tried to apply that approach elsewhere when it fits.

numtel2y ago

I like that term. I always called this "divide and conquer."

ztetranz2y ago

I remember one of the Car-Talk guys use the term "binary chop" once when talking to a caller about diagnosing a problem.

ta12432y ago

> I don’t see junior and intermediate engineers reach for nearly as often as they should

Or senior engineers

Espressosaurus2y ago

That's just how you debug any system.

If you're on this site and haven't already internalized it...how do you debug?

devjab2y ago

How do you debug if this isn’t what you’re doing? I’m genuinely curious… are you using some sort of advanced tools like a psychopath?

recursive2y ago

1. Internet search 2. Make random changes 3. Test 4. Start over.

1 more reply

dancemethis2y ago· 6 in thread

I have a very soft spot for this kind of "campfire story". Open Office not printing on Tuesdays comes to mind. Anyone got some more?

jrlocke2y ago

I'm partial to the 500 mile email: https://www.ibiblio.org/harris/500milemail.html

indrora2y ago

A tale I use in interviews is "The Homesick Laptop's Replacement Desktop That Ate Hard Drives in Summer".

long story short, Dell ship-of-theseus'd an entire machine looking for an issue that only happened on cloudy-hot days when the disks were under high load. It was an air conditioner out of phase with the rest of the system causing EMI that the power supply just let on through.

Night_Thastus2y ago

Reminds me of something I had to deal with in my audio system.

I was getting an awful buzzing/static sound in my speakers. I went through my chain one at a time un-plugging them and seeing if the noise went away or didn't.

As it turns out, my PC's video card was barfing electrical noise through every port...including the ethernet port. Unfortunately I didn't know the difference between shielded and shielded twisted pair and had used shielded by mistake.

That shielded twisted pair allowed the noise to go out of my GPU, into the motherboard, through the ethernet port, then down to my ethernet switch. From there, the switch connected to the raspberry pi I used for streaming, where it helpfully forwarded that noise straight into the DAC and therefore the rest of the chain.

I tell you, that drove me nuts!

WildGreenLeave2y ago

This story comes to mind: https://web.mit.edu/jemorris/humor/500-miles

nicbou2y ago

I briefly collected them on a website called "bedtime stories for engineers". I really wish that someone put in the effort to collect those stories because they're just so good

wkjagt2y ago

Google "My car is allergic to vanilla ice cream". I can only find rehashes of it, not the actual source, so I didn't link to it, but the basic story is easy to find.

m30472y ago· 3 in thread

Ran into a case where a whole datacenter became untethered from its NTP upstream and drifted off into a timezone of its own creation. Customer was failing authentication for a data product we sold them (TSIG was failing). I was on the phone with them for an hour, reassuring them constantly that everything was working for our other customers, tailing logs, and reporting what I saw.

More datacenter stakeholders kept joining the call, most of whom had nothing to do with our data product. Many times I heard people ask "have they found the problem yet" as though.. what? We were the best tech support they had for an entire data center going dark? After an hour somebody noticed that the clocks on servers in the datacenter didn't match up with their laptop; shortly after that I was able to extricate myself from the call... still watching the logs, their downloads started working again a short while later.

yjftsjthsd-h2y ago

> Many times I heard people ask "have they found the problem yet" as though.. what? We were the best tech support they had for an entire data center going dark?

Possible. Some companies are mostly lacking in competent technical people, so anyone who knows what they're doing will quickly find themselves pulled into every possible task; I see no reason why this shouldn't include external parties.

tetha2y ago

Especially if things enter the very strange territory, like NTP running wild.

I pretty much remember some time ago, one of our customers had trouble with our on-prem installation. Eventually it seemed that the database had been corrupted. At that point I could tell the poor guy on the other side was running what I was doing with my colleague on a couple of other, similar systems, but I noted he was getting pretty nervous so I figured as long as the clock runs, whatever. It's kinda what I do a lot at work and I don't like letting people in the rain like that.

And eventually we could confirm that large amounts of VMDKs had been corrupted in various ways. Seemed like another vendor hat let the SAN they were managing run full or into some other catastrophic situation. And their backup appliance also didn't work.

m30472y ago

> Possible.

I had dealt with these people for several years. In my company, I changed teams several times.

But I still got emails inviting me to internal meetings: they thought? assumed? hoped? ...that I was an internal consultant.

This was pursuant to a larger project and I was on the call for that one and pointed out that I was no longer on the implementation team but was thrilled to be there... because I was. That final / initial deployment resulted in WTFs and I said "woot!" and dropped the call.

1 more reply

vdaea2y ago· 3 in thread

Why does this NTP implementation accept a sudden change of 4 billion seconds? For example, the NTP implementation in Windows refuses to change the clock by more than 54,000 seconds.

robocat2y ago

When synchronising Windows with an external source, you can slowly correct the time using: https://learn.microsoft.com/en-us/windows/win32/api/sysinfoa...

Using that API avoids sudden jumps in time. The cost is that if a correction is required then the system time will be incorrect until the difference settles to zero. And you ideally need some PID control so that the system time settles quickly to match the "correct" external time.

For example you can spread a 1 second adjustment over an hour. Sometimes being up to one second out is less problem than a sudden jump of one second.

It is useful to have time monotonically increasing if you have software that depends on time differences (e.g. timestamps stored in logging systems).

Not sure if Microsoft gimped the API after XP - this note seems bad "Currently, Windows Vista and Windows 7 machines will lose any time adjustments set less than 16." Make it difficult to use API to steadily keep time synchronised closely.

pests2y ago

Isn't this what Google dubbed as smearing in their Spanner paper?

2 more replies

usr11062y ago

Linux systemd-timesyncd seems to have some limit, too. When our appliance-like systems get an invalid time from the hardware clock, they boot with a system clock of 2019 IIRC. systemd-timesyncd does not correct that depite me assuming that the ntp server works correctly.

Well, it happens rarely enough and we have workaround, so the bug report still sits in the queue behind more urgent problems. Haven't read the source yet, which is what you should do on Linux of course...

8organicbits2y ago· 3 in thread

I'm not sure I see why it was revoking the certificates, when you renew a certificate that's about to expire you can just let the old one expire, right?

tialaramex2y ago

I'd say that more often than not people building this sort of stuff in-house have no idea what they're doing. So although that part of the design doesn't make much sense it's not astonishing to see it.

A PKI provides a deeply technical solution to a hard problem you probably don't have. This technology is most often deployed when somebody has a different, easy problem, but they don't like the relatively easy non-technical solution.

pixl972y ago

This can go back to your old buddy NTP, specifically DHCP assigning this on untrusted networks. If you control the network (time?) and you manage to get the full expired certificate you may be able to MITM the victim successfully. If you force the CRL check first then things won't match up. I have no ideas on the feasibility of faking the CRL though, so it might be a wash.

nijave2y ago

Seems like it’d be fairly difficult in practice to change time on a host such that you can use an expired certificate without breaking a bunch of other stuff

bdw52042y ago· 3 in thread

The solution to the year 2038 problem is to upgrade your time since the Unix epoch fields to 64 bit integers. Hopefully this won't be an actual issue 14 years from now because it's such a simple fix.

wongarsu2y ago

About a decade ago I was involved in the development of an embedded product for industrial use cases. The kind of stuff you install once and use for 20-40 years. The library we used for displaying the time breaks around 2036 (so a bit ahead of y2k38). But the person responsible would long be in retirement by then and the issue doesn't impact critical functions, so it was decided not to do anything about it. This version of the product is still sold today. I doubt this story is uncommon.

ooterness2y ago

It's just like Y2K, and just as pervasive, but harder to explain to upper management. My guess is that it won't go smoothly.

ta12432y ago

There's a perception that y2k was overblown because we spent tons of money and didn't have the problems that were suggested in the media

Of course the fact that the problems were overhyped, but importantly FIXED by all that money, doesn't come into it, it was a cry-wolf situation.

1 more reply

denton-scratch2y ago· 2 in thread

> I suspect the NTP server had a badly faulty internal clock which ran very fast.

A time server with a defective clock seems to be a serious problem. Zimmie says the time server was an appliance; so someone is selling as an appliance a time server that can't tell the time.

zokier2y ago

Not only could it not tell the time, but also had catastrophic bugs in time handling and does not handle y2k38. If that was on my network, the vendor would get yeeted immediately.

thyrsus2y ago

I was quite aware that our (different company) time server was based on getting CDMA signal, and - oh, wait - CDMA was retired last year? Luckily, we could open our internet firewall to let it talk to external stratum 1 servers and configure it to be stratum 2, rather than the stratum 1 we bought. Replacements using GPS are in process, but are impeded by weak GPS signal in the data centers. Antennas are to be implemented...

macintux2y ago· 2 in thread

Not nearly as interesting a story: in 1996 I visited a customer who was using up for dialup services, but reported some of their Windows desktops couldn't connect.

It didn't take me long to figure out that the computers that weren't working had their clocks set well into the 21st century. The shell couldn't even display the year properly, I assumed a Y2K incompatibility, but after so many years now I can't remember exactly what I saw.

Anyway, easy fix, but I never did find out what caused such a weird glitch in their environment. It's small wonder that many people aren't fluent with computers: they misbehave in such a wide variety of ways.

kro2y ago

Last year I had a (of many) freshly provisioned Linux VMs clock change to the year 2257 2 nights in a row. Never figured that out sadly, reprovisioning "fixed" that.

justsomehnguy2y ago

> Ambassador Kosh's ship arrives at the Epsilon III Jumpgate two days ahead of schedule. Upon leaving his ship the Minbari assassin approaches Kosh in the guise of Jeffrey Sinclair, whom Kosh recognises as Entil'Zha Valen. When he extends a hand in greeting to his 'old friend' the Minbari slaps a skin tab dosed with Florazyne, causing Kosh to collapse and lose consciousness

Oh no.

pxeger12y ago· 1 in thread

> the CRL size for the median certificate is 51KB and that half of all CRLs are under 900B.

What? So there are no CRLs between 900B and 51KB, and the first one larger than 51KB just happened to be the median one??

hcs2y ago

Not sure, but: median certificate (so each CRL has a multiplicity of however many certificates would use it, or perhaps of how many times it is actively retrieved) vs median CRL size (each CRL listed once)

Or they meant mean for the first one, I guess.

Edit: it's the former, from the paper:

> We immediately observe that half of all CRLs are under 900 B. However, this statistic is deceiving: if you select a certificate at random from the Leaf Set, it is unlikely to point to a tiny CRL, since the tiny CRLs cover very few certificates.

arter42y ago

They 2038 thing I get, but the clock drift of BILLIONS of seconds really scares me. What kind of fucked up setup can lead to something like this?

cesarb2y ago

This reminded me of this article from last year: https://arstechnica.com/security/2023/08/windows-feature-tha... (HN discussion: https://news.ycombinator.com/item?id=37151220)

raffraffraff2y ago

Oh NTP... I remember a series of extremely annoying incidents that were caused by time skew on hundreds of Linux VMs in our data center. Our setup was typical of a startup - built to be good enough at first, and fall apart at scale.

Every VM ran CentOS, and every one of them hit the default CentOS ntp servers. These are run by volunteers. The pool is generally good quality but using it the way we did was extremely stupid.

Every few weeks we'd have one of these "events" where hundreds of VMs in a data center would skew, causing havok with authentication, replication, clustering. We also had an alert that would notify the machine owner if drift exceeded some value. If that happened in the middle of the night, the oncall from every single team would get woken. And if they simply "acked" the alert and go back to sleep, the drift would continue, and by morning their service would almost certainly be suffering.

Whatever about diagnosing the cause, I started by writing a script that executed a time fix against a chosen internal server, just to resolve the immediate issue. I also converted the spam alert into one that Sensu (the monitoring/alerting system we used) would aggregate into a single alert to the fleet ops team. In other words, if >2% of machines was skewed by more than a few seconds, warn us. At >4%, go critical. (Only critical alerts would alert oncall outside sociable hours).

Long story short, we switched to chrony, because unlike ntpd we could convince it to "just fix the damn time", because ntpd would refuse to correct the time if the jump was too big, and would just drift off forever until manually fixed. (No amount of config hacking and reading 'man ntpd' got around this). We also chose two bare-metal servers in each data center to work as internal NTP servers, reducing the possibility of DOSing these volunteer NTP servers and getting our IP range blacklisted or fed dud data. Problem solved right there, and we also ended up with better monitoring of our time skew across our fleet.

frereubu2y ago

Related and fascinating article that came up on HN recently after the originator of NTP, David Mills, died: https://www.newyorker.com/tech/annals-of-technology/the-thor...

(Just turn off JavaScript to read it if you hit a paywall).

j / k navigate · click thread line to collapse

57 comments

49 comments · 13 top-level

eszed2y ago· 13 in thread

This is a great mystery story, with a satisfying ending. And this

is an approach every every one of us should internalize.

stouset2y ago

Binary search (or bisecting) is also an incredibly valuable approach that I don’t see junior and intermediate engineers reach for nearly as often as they should.

In my experience this process is the primary distinguisher between those who flail around looking for a root cause and the people who can rapidly come to an answer.

eszed2y ago

Good call. When you've got no idea where to start, that's how to start.

BHSPitMonkey2y ago

marcus0x622y ago

Once someone understands a complex system well enough to find a good midpoint, are they still a junior engineer?

EnigmaFlare2y ago

That random guessing is like gambling - you hope for a big quick payout but when your hypothesis fails, you end up worse off than before. Wasted time and no closer to the solution.

paulddraper2y ago

100%

I've wondered why this isn't second nature to engineers, junior or otherwise.

Maybe they don't really understand the pipeline? ("I enter the value in the web form and it just appears in the database.")

drewzero12y ago

numtel2y ago

I like that term. I always called this "divide and conquer."

ztetranz2y ago

I remember one of the Car-Talk guys use the term "binary chop" once when talking to a caller about diagnosing a problem.

ta12432y ago

> I don’t see junior and intermediate engineers reach for nearly as often as they should

Or senior engineers

Espressosaurus2y ago

That's just how you debug any system.

If you're on this site and haven't already internalized it...how do you debug?

devjab2y ago

How do you debug if this isn’t what you’re doing? I’m genuinely curious… are you using some sort of advanced tools like a psychopath?

recursive2y ago

1. Internet search 2. Make random changes 3. Test 4. Start over.

1 more reply

dancemethis2y ago· 6 in thread

I have a very soft spot for this kind of "campfire story". Open Office not printing on Tuesdays comes to mind. Anyone got some more?

jrlocke2y ago

I'm partial to the 500 mile email: https://www.ibiblio.org/harris/500milemail.html

indrora2y ago

A tale I use in interviews is "The Homesick Laptop's Replacement Desktop That Ate Hard Drives in Summer".

Night_Thastus2y ago

Reminds me of something I had to deal with in my audio system.

I was getting an awful buzzing/static sound in my speakers. I went through my chain one at a time un-plugging them and seeing if the noise went away or didn't.

I tell you, that drove me nuts!

WildGreenLeave2y ago

This story comes to mind: https://web.mit.edu/jemorris/humor/500-miles

nicbou2y ago

I briefly collected them on a website called "bedtime stories for engineers". I really wish that someone put in the effort to collect those stories because they're just so good

wkjagt2y ago

Google "My car is allergic to vanilla ice cream". I can only find rehashes of it, not the actual source, so I didn't link to it, but the basic story is easy to find.

m30472y ago· 3 in thread

yjftsjthsd-h2y ago

> Many times I heard people ask "have they found the problem yet" as though.. what? We were the best tech support they had for an entire data center going dark?

tetha2y ago

Especially if things enter the very strange territory, like NTP running wild.

m30472y ago

> Possible.

I had dealt with these people for several years. In my company, I changed teams several times.

But I still got emails inviting me to internal meetings: they thought? assumed? hoped? ...that I was an internal consultant.

1 more reply

vdaea2y ago· 3 in thread

Why does this NTP implementation accept a sudden change of 4 billion seconds? For example, the NTP implementation in Windows refuses to change the clock by more than 54,000 seconds.

robocat2y ago

When synchronising Windows with an external source, you can slowly correct the time using: https://learn.microsoft.com/en-us/windows/win32/api/sysinfoa...

For example you can spread a 1 second adjustment over an hour. Sometimes being up to one second out is less problem than a sudden jump of one second.

It is useful to have time monotonically increasing if you have software that depends on time differences (e.g. timestamps stored in logging systems).

pests2y ago

Isn't this what Google dubbed as smearing in their Spanner paper?

2 more replies

usr11062y ago

8organicbits2y ago· 3 in thread

I'm not sure I see why it was revoking the certificates, when you renew a certificate that's about to expire you can just let the old one expire, right?

tialaramex2y ago

pixl972y ago

nijave2y ago

Seems like it’d be fairly difficult in practice to change time on a host such that you can use an expired certificate without breaking a bunch of other stuff

bdw52042y ago· 3 in thread

The solution to the year 2038 problem is to upgrade your time since the Unix epoch fields to 64 bit integers. Hopefully this won't be an actual issue 14 years from now because it's such a simple fix.

wongarsu2y ago

ooterness2y ago

It's just like Y2K, and just as pervasive, but harder to explain to upper management. My guess is that it won't go smoothly.

ta12432y ago

There's a perception that y2k was overblown because we spent tons of money and didn't have the problems that were suggested in the media

Of course the fact that the problems were overhyped, but importantly FIXED by all that money, doesn't come into it, it was a cry-wolf situation.

1 more reply

denton-scratch2y ago· 2 in thread

> I suspect the NTP server had a badly faulty internal clock which ran very fast.

A time server with a defective clock seems to be a serious problem. Zimmie says the time server was an appliance; so someone is selling as an appliance a time server that can't tell the time.

zokier2y ago

Not only could it not tell the time, but also had catastrophic bugs in time handling and does not handle y2k38. If that was on my network, the vendor would get yeeted immediately.

thyrsus2y ago

macintux2y ago· 2 in thread

Not nearly as interesting a story: in 1996 I visited a customer who was using up for dialup services, but reported some of their Windows desktops couldn't connect.

kro2y ago

Last year I had a (of many) freshly provisioned Linux VMs clock change to the year 2257 2 nights in a row. Never figured that out sadly, reprovisioning "fixed" that.

justsomehnguy2y ago

Oh no.

pxeger12y ago· 1 in thread

> the CRL size for the median certificate is 51KB and that half of all CRLs are under 900B.

What? So there are no CRLs between 900B and 51KB, and the first one larger than 51KB just happened to be the median one??

hcs2y ago

Or they meant mean for the first one, I guess.

Edit: it's the former, from the paper:

arter42y ago

They 2038 thing I get, but the clock drift of BILLIONS of seconds really scares me. What kind of fucked up setup can lead to something like this?

cesarb2y ago

This reminded me of this article from last year: https://arstechnica.com/security/2023/08/windows-feature-tha... (HN discussion: https://news.ycombinator.com/item?id=37151220)

raffraffraff2y ago

Every VM ran CentOS, and every one of them hit the default CentOS ntp servers. These are run by volunteers. The pool is generally good quality but using it the way we did was extremely stupid.

frereubu2y ago

Related and fascinating article that came up on HN recently after the originator of NTP, David Mills, died: https://www.newyorker.com/tech/annals-of-technology/the-thor...

(Just turn off JavaScript to read it if you hit a paywall).

j / k navigate · click thread line to collapse