> I generally start troubleshooting an issue by asking the system what it is doing," explained Zimmie. "Packet captures, poking through logs, and so on. After a few rounds of this, I start hypothesizing a reason, and testing my hypothesis. Basic scientific method stuff. Most are simple to check. When they're wrong, I just move on. As I start narrowing down the possibilities, and the hypotheses are proven, it's electric. Making an educated guess and proving it's right is incredibly satisfying.
is an approach every every one of us should internalize.
When some thing is failing, find a midpoint between where things are working and where the bug is manifesting. Do you see evidence of the bug? If not, look earlier in the pipeline. If so, look later. Repeat.
In my experience this process is the primary distinguisher between those who flail around looking for a root cause and the people who can rapidly come to an answer.
Mostly, though, I think people "flail" because they don't know the pipeline well enough to even do that. I know I've been in that position before, when approaching completely new (to me) systems. (Sometimes there isn't someone more knowledgeable you can ask!) That's where I find hypothesis -> test -> refine particularly useful. You're still wrong far, far more often than you're right, but it stops feeling like flailing, and more like making progress towards understanding the system well enough to apply other techniques (whatever they might be) more smartly.
Part of why it's so useful is you hardly have understand anything about the system internally. Just reduce the complexity of what you're doing until it works to find the lower bound if you don't already have a working case.
That random guessing is like gambling - you hope for a big quick payout but when your hypothesis fails, you end up worse off than before. Wasted time and no closer to the solution.
I've wondered why this isn't second nature to engineers, junior or otherwise.
Maybe they don't really understand the pipeline? ("I enter the value in the web form and it just appears in the database.")
Or senior engineers
If you're on this site and haven't already internalized it...how do you debug?
More datacenter stakeholders kept joining the call, most of whom had nothing to do with our data product. Many times I heard people ask "have they found the problem yet" as though.. what? We were the best tech support they had for an entire data center going dark? After an hour somebody noticed that the clocks on servers in the datacenter didn't match up with their laptop; shortly after that I was able to extricate myself from the call... still watching the logs, their downloads started working again a short while later.
Possible. Some companies are mostly lacking in competent technical people, so anyone who knows what they're doing will quickly find themselves pulled into every possible task; I see no reason why this shouldn't include external parties.
I pretty much remember some time ago, one of our customers had trouble with our on-prem installation. Eventually it seemed that the database had been corrupted. At that point I could tell the poor guy on the other side was running what I was doing with my colleague on a couple of other, similar systems, but I noted he was getting pretty nervous so I figured as long as the clock runs, whatever. It's kinda what I do a lot at work and I don't like letting people in the rain like that.
And eventually we could confirm that large amounts of VMDKs had been corrupted in various ways. Seemed like another vendor hat let the SAN they were managing run full or into some other catastrophic situation. And their backup appliance also didn't work.
I had dealt with these people for several years. In my company, I changed teams several times.
But I still got emails inviting me to internal meetings: they thought? assumed? hoped? ...that I was an internal consultant.
This was pursuant to a larger project and I was on the call for that one and pointed out that I was no longer on the implementation team but was thrilled to be there... because I was. That final / initial deployment resulted in WTFs and I said "woot!" and dropped the call.
long story short, Dell ship-of-theseus'd an entire machine looking for an issue that only happened on cloudy-hot days when the disks were under high load. It was an air conditioner out of phase with the rest of the system causing EMI that the power supply just let on through.
I was getting an awful buzzing/static sound in my speakers. I went through my chain one at a time un-plugging them and seeing if the noise went away or didn't.
As it turns out, my PC's video card was barfing electrical noise through every port...including the ethernet port. Unfortunately I didn't know the difference between shielded and shielded twisted pair and had used shielded by mistake.
That shielded twisted pair allowed the noise to go out of my GPU, into the motherboard, through the ethernet port, then down to my ethernet switch. From there, the switch connected to the raspberry pi I used for streaming, where it helpfully forwarded that noise straight into the DAC and therefore the rest of the chain.
I tell you, that drove me nuts!
A time server with a defective clock seems to be a serious problem. Zimmie says the time server was an appliance; so someone is selling as an appliance a time server that can't tell the time.
It didn't take me long to figure out that the computers that weren't working had their clocks set well into the 21st century. The shell couldn't even display the year properly, I assumed a Y2K incompatibility, but after so many years now I can't remember exactly what I saw.
Anyway, easy fix, but I never did find out what caused such a weird glitch in their environment. It's small wonder that many people aren't fluent with computers: they misbehave in such a wide variety of ways.
Oh no.
Using that API avoids sudden jumps in time. The cost is that if a correction is required then the system time will be incorrect until the difference settles to zero. And you ideally need some PID control so that the system time settles quickly to match the "correct" external time.
For example you can spread a 1 second adjustment over an hour. Sometimes being up to one second out is less problem than a sudden jump of one second.
It is useful to have time monotonically increasing if you have software that depends on time differences (e.g. timestamps stored in logging systems).
Not sure if Microsoft gimped the API after XP - this note seems bad "Currently, Windows Vista and Windows 7 machines will lose any time adjustments set less than 16." Make it difficult to use API to steadily keep time synchronised closely.
Well, it happens rarely enough and we have workaround, so the bug report still sits in the queue behind more urgent problems. Haven't read the source yet, which is what you should do on Linux of course...
A PKI provides a deeply technical solution to a hard problem you probably don't have. This technology is most often deployed when somebody has a different, easy problem, but they don't like the relatively easy non-technical solution.
Of course the fact that the problems were overhyped, but importantly FIXED by all that money, doesn't come into it, it was a cry-wolf situation.
Every VM ran CentOS, and every one of them hit the default CentOS ntp servers. These are run by volunteers. The pool is generally good quality but using it the way we did was extremely stupid.
Every few weeks we'd have one of these "events" where hundreds of VMs in a data center would skew, causing havok with authentication, replication, clustering. We also had an alert that would notify the machine owner if drift exceeded some value. If that happened in the middle of the night, the oncall from every single team would get woken. And if they simply "acked" the alert and go back to sleep, the drift would continue, and by morning their service would almost certainly be suffering.
Whatever about diagnosing the cause, I started by writing a script that executed a time fix against a chosen internal server, just to resolve the immediate issue. I also converted the spam alert into one that Sensu (the monitoring/alerting system we used) would aggregate into a single alert to the fleet ops team. In other words, if >2% of machines was skewed by more than a few seconds, warn us. At >4%, go critical. (Only critical alerts would alert oncall outside sociable hours).
Long story short, we switched to chrony, because unlike ntpd we could convince it to "just fix the damn time", because ntpd would refuse to correct the time if the jump was too big, and would just drift off forever until manually fixed. (No amount of config hacking and reading 'man ntpd' got around this). We also chose two bare-metal servers in each data center to work as internal NTP servers, reducing the possibility of DOSing these volunteer NTP servers and getting our IP range blacklisted or fed dud data. Problem solved right there, and we also ended up with better monitoring of our time skew across our fleet.
(Just turn off JavaScript to read it if you hit a paywall).
What? So there are no CRLs between 900B and 51KB, and the first one larger than 51KB just happened to be the median one??
Or they meant mean for the first one, I guess.
Edit: it's the former, from the paper:
> We immediately observe that half of all CRLs are under 900 B. However, this statistic is deceiving: if you select a certificate at random from the Leaf Set, it is unlikely to point to a tiny CRL, since the tiny CRLs cover very few certificates.