The symptom was that it started "typing" automatically.
My group got everything setup, built our program, and everything worked fine. Waited a few minutes for the TA to verify, but it failed. We changed a few things, it worked, but failed when he came over.
Another group looked over our code, no issues noticed.
Finally I realized I was standing when we were testing things. I sat down waiting for the TA to verify. My shadow blocked the sun from the photo eye. Wasted half the lab on an issue that was entirely dependent on our position in the room, but found the root cause.
I don't think it was entirely wasted time, though. You can't plan to teach that kind of lesson ("Look outside of your usual blinkered problem-solving-space"), it happens when it happens.
As this whole thread shows, most of us learn it during our careers at some point but you were lucky enough to learn it before you even started.
What clued you in to this possibility?
There's period of about 10 days every spring and fall where, for up to 30 minutes every day, the sun transits 'behind' a satellite within the beamwidth of the dish and totally overwhelms the signal at the LNB.
It was not just that the server crashed, but the ventilation proved to be bad enough that one of the SCSI drives (which should have been in a RAID, but wasn't...) wouldn't start.
They ended up opening it to try to kickstart it manually (been there, done that myself; had a drive survive 6 months with me "helping" the motor spin it up every morning; yes I backed everything up very regularly during that period), finding the drive head had gotten stuck to whatever material covered the plate. They ended up putting the drive in an oven while connected, and heat it until it spun up, and which point they dumped what data they could.
Later I worked with a IT manager who also confirmed it was such-and-such bank in Tønsberg. I have forgotten the name of the bank but I am still on friendly terms with him so I could ask next time I see him.
(My first draft of this post said that the first bloke had claimed to be in the room, but this is 15 - 16 years ago and the next story is also close to a decade ago so I might have mixed up who said what.)
There was an outlet in the hallway, right outside the glass window looking into the server room... you guessed it -- plug in a cleaning machine, blow the fuse, take down servers...
The duct would vibrate when the air was on, and the corner was pretty sharp, which caused the duct corner to 'saw' its way through the cable's insulation over time.
Took a while to isolate the problem to 'its between this box and this box' but was a pretty quick find after that :)
A few months ago our garage door openers started working like normal again. That was great until we realized it was because the inverter had failed. When the inverter was replaced, our door problems started again.
(That's how I fixed mine.)
Somewhat infamously, "a rare alignment of sunlight on high-altitude clouds above North Dakota and the Molniya orbits of the satellites" the Soviets used for their nuclear attack early warning system triggered a false alarm, which, had it been treated as a real situation, could have lead to nuclear war in the early 80s.
https://en.wikipedia.org/wiki/1983_Soviet_nuclear_false_alar...
Amazing story!
Reason for the mysterious network outage ? Thermal contraction! The observatory was connected to the Internet via an optical link to a highrise building in the city that contracted ever so slightly due to the very low temperature, moving the laser beam of the optical link out of alignment, shutting down the connection.
also they don't do as well in rain as something that can adapt modulation.
If internet went out on any campus we would go and replace the lightbulb left on the transmitter. A couple minutes later all would be good.
I've set up a couple of monitoring systems at a couple of different companies and one thing I've heard some people saying is that they don't care about "fancy graphs", they just want a dashboard of what is red and what is green.
This might be a manager vs engineer perspective, because for me the graphs are the main point: it allows me to spot
- patterns (each night, each weekend, some weekends, more-or-less-randomly-except)
- and also trends: at this speed we are going to reach 80% utilization before November.
I have often found it surprisingly difficult (in spite of being an Engineer myself) to read and interpret the various graphs in monitoring dashboards when i don't know what i am looking for. This is ten times harder for most "Manager" types.
Unfortunately we didn't have the hardware or enough control over the link (it took negotiating access with armed forces to work on either end) to try to implement any of their ideas.
Meh. Just the old 49.7 days cycle that it takes to overflow 32 bits when measuring miliseconds.
I was hoping for a "it works when I buy vanilla icecream and doesn't when I buy other flavour".
> Just the old 49.7 days cycle...
I've encountered datetime bugs and learned to take preventative measures.
I generally add a virtual clock shim to my projects, eg wrapping System.currentTimeMillis() or equiv.
Then I write unit tests for anticipated edge cases. Like midnight, end/start of year, etc. To ensure reporting, rollups, logging, grooming, etc. are working correctly.
Also allows me simulate elapsed time, so I verify out of order event processing and so forth.
The quality of the scanned books was excellent, except for a weird distortion every so often where part of the page would be shifted partway through as if someone had shifted half the page in Photoshop. This was only noticed in books over a certain size so people were checking to see if there was some kind of mechanical problem with the scanner (these were robots with automatic page turners so it was plausible that there could be something which was only an issue past a certain position), trying to figure out of there was some way that the software had some kind of memory leak or other issue which would explain the long and inconsistent intervals.
Eventually they were on a long-distance phone call to Moscow and not turning up anything when there was a loud rumble in the background. “What was that?” lead to the realization that the library's scan center was close to a subway tunnel. The vibration of a passing train was enough to cause a glitch but only if you happened to be scanning at the exact time it went by: the reason longer books were noticed was simply because having more pages meant that at any point in time a long book was more likely to be sitting in the scanner and the technicians running the scanner were apparently tuning out the trains as background noise. This was reportedly the first project they'd done with one of the scan robots which can process an entire book unattended so it was plausible that smaller past projects simply hadn't been scanning frequently enough to hit this problem or that some previous technician had noticed and immediately redone the page.
Nobody could figure it out so they called in an expert.
After lots of attempts and figuring, one day the person in question happens to look out the window at the time in question ... and sees a service truck park exactly in line-of-sight between the business and their internet-signal pickup broadcast point.
Ah ha!
why this would occur i'll never know. (probably old telephone wiring wrapped around old 110v wiring? maybe? or who knows what kind of weird leakage/ground loops may have existed)
Finally I made it to tier three, with someone who seemed obviously competent. Within about a minute, he checked the power usage on my modem and then historically, and knew immediately that if I moved my modem to another outlet, it would work.
It did. Never had that type of connection issue again.
Such a weird thing to troubleshoot when you have a few people living in the same house.
Much troubleshooting later, it turns out that when they'd been doing some maintenance in the lift shaft (which was also used to drop the inter-building 10Base-5 yellow snake), they'd managed to shoot a nail through the Ethernet cable and we now had a nice 50Hz hum on the cable.
Retries in TCP made that work, but ARP doesn't have retries, so if that managed to get faded out, you'd hope to get lucky next time...
When I worked for BBN in '97-'98, someone from outside the company as I recall came to talk to a room of engineers about the wide variety of calendar-related behaviors in various UNIX systems that were expected to cause problems for Y2K.
It was a very, very long list, often subtle issues, and I recall the concern in the room about the number of old systems in use by the DoD and others.
Anyway, no real point to this other than date handling is one of the hardest things to get right in computing, ranking right behind testing for the correct behavior.
The date bug I committed with the longest tail was daylight time. It was all good until we got to a day with 25 hours when we "fell back."
spoiler: Helium messes with MEMS oscillator, causing iPhones to stop working (the clock signal is basically flatlined)
There was a big problem where we needed to upgrade the fans to deal with the heat dissipation, but it was destroying the performance of the spinning disk HDDs due to the vibration of the fans.
(these were 2U devices with 5 boards: 2 control-plane boards (1 active, 1 on stand-by for redundancy) & 3 data-plane boards (2 active, 1 stand-by))
If it was a big problem, that must not have been viable? Too cramped?
There was a program I heard about back in the 90s which would literally crash depending on the phase of the moon!
The story is that it wanted to print a date. The programmer happened to have an astronomy library available that gave a string containing the date. So the programmer called that, and then parsed out the date.
Unfortunately the astronomy library wrote its result as a string to a point. The result included the phase of the Moon. The pointer was not declared to be long enough. And therefore, would crash if the name of the phase of the moon was too long!
A little bit disappointing to discover that the code from the article does not actually depend on the phase of the moon. I'm really interested to see the other stories here where it actually is the case that the phase of the moon is affecting people's code.
God bless gusts of random math people leave about.
One day, I started receiving calls (through my pager!) from rather many people about intermittent networking problems. The state of the art 10mbit wired UTP network would have frequent bursts of 90% package loss.
What was weird: only people on the fifth floor would have this issue..!? Our first thought was that they were on a single hub/switch that might have broken. But no, they were connected to the same uplinks as the computers on the problem-free surrounding floors. Furthermore, laptop users (who were of course also wired at the time) were reporting no problems whatsoever.
We were pretty much out of ideas by that point, but did an experiment just to test our assumptions: we took a PC and hooked it up with a long network cable and a power extension cable on the fourth floor and started pinging it. Flawless. Then we started walking up the stairs, and, yes indeed, somewhere around halfway up the stairs packets started to drop. (But not at all times, sometimes it would be fine, like all PCs on the fifth.)
If you want to guess at the cause, this is your chance. :-)
We brought in a company specialized in EM interference. It turns out that a GSM antenna placed on the roof of the four story building opposite to ours about half a year ago, had just been turned on. Its height aligned to our fifth flour. Whenever someone was using this mast to make a call (which certainly wasn't all of the time back then), it would cause interference on a specific model of network card that we were using in all of our PCs. It had a relatively large metal component that was apparently a pretty good 900 MHz antenna.
When confronted, the mobile operator quickly adjusted the antenna to not be directed at us. I believe all network cards were replaced soon after. Fun times!
We spent about a week trying to debug the system and the software and at a certain point while I was just sitting and thinking about what to do next, Flying Toasters popped up in the data logging PC (the lid was normally closed because of the space on the bench).
The Windows screensaver was hogging so much CPU that the datalogger couldn't keep up.
My personal anecdote. I like playing online games, and as you know latency is the killer. I enjoyed playing in the evenings after work, and inexplicably I started noticing my latency spike from around 50ms to > 1s. Extremely frustrating.
I had no idea what caused this so I set up a simple ping command and had it save it to a graph.
Well, the next day I noticed the pings were steady throughout the whole day, then in the evenings I'd get these chunks of bad time. It turns out when my wife would watch Netflix in the other room (and it was only Netflix), it'd cause something to go awry with the router and latency would spike for me. (The really weird thing was that it was a combination of a Roku, Netflix, and a wired switch - change any of those and the problem went away).
Later during the pandemic, I also diagnosed drop-outs on my network due to kids in my neighborhood being online during school hours. Like clockwork I'd get a bad network from around 10pm and it'd be fine ending around 3 or 4. On school holidays and weekends my network was fine.
The problem was that they lived on the coast, and a subsurface junction box would get wet during king tides, causing the telephone line to fail.
Yep! Because foone did a whole month-long Twitter thread on it, even had a livestream showing the crash.
It used Unix timestamps (seconds since 1970) and assumed they could only be 9 decimal digits. When the time reached 10 digits, the last digit was quietly dropped.
(It was fixed within a few days.)
Months of debugging, dozen people involved, tens of thousands of devices bricked, tens of millions lost.
All due to a single line of code that configured flash to not require special magic before each command. This feature made to improve resistance to interference also hindered performance. Somebody thought it a good idea to disable to get some points for improved performance.
What i love about these problem-solving anecdotes is how a seemingly totally different domain is the key to the solution. It always makes me marvel at how interconnected everything in our World is. Strengthens my belief that "Cross-Disciplinary" knowledge is where "Wisdom" lies and is the key to our Future.
“From a drop of water, a logician could infer the possibility of an Atlantic or a Niagara without having seen or heard of one or the other. So all life is a great chain, the nature of which is known whenever we are shown a single link to it.” --- Sherlock Holmes in A Study in Scarlet
Sound like a case of werecode
The most notable exception would be languages which allow negative indexing, but IMHO if that were syntactic instead of relying on actual signed integers, it would be safer (I.e., [- $int] would be a different syntax from [(-$int)] and the latter would not be correctly typed.)