- February 11th: Vendor informed of the issue
- February 25th: 28 people die because of the issue
- February 26th: The vendor ships a fix
I'd have loved to be a fly on the wall for that phonecall on the 25th (or early on the 26th).Feb 21--notice goes out to users to avoid "very long run times". Users do not know what that means, and ignore warning.
https://www.gao.gov/assets/220/215614.pdf (page 9)
"On February 21, 1991, the Patriot Project Office sent a message to Patriot users stating that very long run times could cause a shift in the range gate, resulting in the target being offset. The message also said a software change was being sent that would improve the system’s targeting. However, the message did not specify what constitutes very long run times. According to Army officials, they presumed that the users would not continuously run the batteries for such extended periods of time that the Patriot would fail to track targets. Therefore, they did not think that more detailed guidance was required."
> According to Army officials, the delay in distributing the software from the United States to all Patriot locations was due to the time it took to arrange for air and ground transportation in a wartime environment.
I'm not knowledgeable at all on how software for missile batteries was distributed in 1991 from the US to the Persian Gulf but 11 days doesn't seem unreasonable to me.
It makes me kind of sick imagining that call.
Aren't defense contractors required to be on their toes all the time?
EDIT → Found it. The PATRIOT Project Office.
A more truthful "computer bugs that killed people" example would be the Therac-25 - a machine intended to treat cancer with tightly-focused radiation therapy. Six patients died as a result of massive overdoses of radiation, on the order of 20,000 rads. It was possible for the machine to end up in a state where it delivered full-power radiation without a hardware shield in place to protect the rest of the patient's body. No hardware interlocks were used to ensure that the full power mode was only usable with the shield in place - all safety features relied on software. In addition, the bug was only possible when an operator made a mistake in mode selection and then rapidly (proficiently) corrected it - the rapidity required prevented the bug from being discovered during slow, methodic, careful testing.
See Hackaday's article Killed by a Machine (and associated HN discussion) or for the especially curious, a 49-page post-mortem for more detail:
https://hackaday.com/2015/10/26/killed-by-a-machine-the-ther...
At the time, this incident really stuck out because it broke the illusion of our fabled Patriot missile shield protecting us. Civilian expats really believed the inflated Patriot interception rates parroted to us by mainstream media and our American military expat buddies.
A large number of remaining expats who had stuck out the Gulf War to that point decided to pack it in and leave when word got out that the Dhahran barracks were hit. Although history shows that Iraq surrendered days after this incident, at the time there was heightened fear and confusion amongst the remaining expats, especially the non-Americans.
We left on the last Lufthansa flight (crewed by military personnel) after hearing about this.
Nostalgic edit:
During the Gulf War embassies issued equipment and rations to expat citizens who chose to stay behind. Americans were issued full body suits (for adults and youths) due to the biological and chemical weapon payloads that Saddam boasted his SCUDs were carrying, along with MREs that tasted fabulous! In stark contrast, Commonwealth citizens were issued a bare gas mask (adult size only) and mono-flavour MREs that tasted like cardboard.
The British embassy sticks out in my mind: with stern stone-faced expressions they admonished us all for not evacuating and thus endangering children in a war zone. In addition to the terrible rations and gas masks, they wordlessly gave us a stack of translucent stickers. When asked what they were for, embassy staff explained that in the event of the air siren going off, we should get under our sturdiest tables and don our gas masks (standard procedure), and then slap the stickers on. If the stickers changed colour, it meant we were in the presence of a biochemical agent and would have approximately 10 seconds before we died a horrific death.
You kind of had to be there to appreciate the grim humour.
While most of us were cowering under our desks and tables during SCUD attacks, some of our American civilian friends were out with their families in the desert trying to film the Patriots "intercepting" the SCUDs and driving out to try and pick up pieces of debris.
I look back upon those days with fondness and gratitude, especially for the American forces that served.
1. I remember hearing the system was only designed for XX operational hours but was being run over the operational spec.
2. The time was stored in base 10 so the calculation errors added up over time or something like that so if they had used some base 2 timing scheme it would haven't have had issues with rounding errors.
My class was in the mid nineties so the details of my 25 year old memory is pretty hazy...at best.
http://catless.ncl.ac.uk/Risks/13/35#subj1.1
http://catless.ncl.ac.uk/Risks/13/76#subj8.1
And in 1997:
If an EMT fails to save a victim of a car crash, did he/she kill the victim? If the dispatcher misspoke and gave the wrong cross street, delaying aid, did the dispatcher kill them?
You can use a gravitational map that only accounts for latitude, but it isn't as precise.
So using an accurate clock is really important if your intent is to hit a missile with a missile.
[0] http://www.slate.com/articles/news_and_politics/war_stories/...
It now makes much more sense to me that a (terrible) mishap had occurred and possible prevention was only a reboot away. I can see how being exposed to that context at upper levels could easily cause one to latch onto any perceived preventative measures.
I also once saw a short ntp time step across multiple clusters (yeh, simultaneously) shut down half of a wafer factory.
Time is important.. but rebooting all your systems at midnight probably will not help you to control it. This especially if there are large, hot, fast objects flying around in the night sky and definitely, really, don't do ALL of them at the same time every day .. especially during, you know, battle. /pro-tip
Mods should change this. The "software fix" was a software patch which corrected the clocking bug.
The "software workaround" to use pre-fix was reboot.
I hate editorialized, lying titles :(
I don't mean to second-guess them in an area I know so little about, but if that was enough to cause a serious issue in the span of only a few days, shouldn't the devices be designed with a separate synchronization system, at least as a backup? Maybe GPS?
Which brings up a sort of interesting question...would a Patriot missile system even have receivers for a weak public signal like GPS, or is it all self-contained?
But the rest of your point boils down to ’if you know your system has a flaw why not mitigate it’? But of course at design time they didn’t know it had this flaw.
i'd assume it could, since GPS is military, and a mobile missile system is exactly the sort of thing that wants to know where it is, so would have the keys to the (higher resolution) encrypted GPS signals as well.
There is no 1 answer, this argument is a result of black-white/yes-no/us-them single point of blame thinking. and it's terrible.
the bug contributed to the loss of life.
This is a strictly technical examination of the proximate cause of their deaths; it makes no claims about their ultimate cause. Whether or not a missile system with an accurate clock might have hit the target, it is unambiguous that this one missed specifically because of clock drift.
I feel "preventable deaths" is a preferable focus over "cause of death".
This just isn't so. Also the degree of acceptable reliability that is reasonable is different in a missile defense system vs the toy your grandma uses to browse facebook.
It had to be rebooted because a bug caused it to be increasingly inaccurate the longer it was booted up. This was always broken. It wasn't an acceptable fix because you manifestly can't trust users to do so as shown by the 28 corpses. It was however probably the best that could be done on short notice.
Taking 60-90s completely out of protection to reboot a critical defensive system when someone might, at any moment, toss a Mach 5 projectile at you from a couple hundred miles away is a far-from-ideal fix, even if it had been communicated properly to the end users (which it wasn't.)
https://embeddedgurus.com/barr-code/2014/03/lethal-software-...