If you have, say, a 10-drive wide RAID6 you would need to source drives from 5 manufacturers/batches/models in order to be resilient to that kind of failure. Even if that was feasible that seems horrible to maintain long-term.
Doing a red/blue setup where your red systems use one type of drive and your blue systems use another type of drive seems like it could be reasonably accomplished.
Note that it wouldnțt help in this instance, as the bug is caused by the amount of time a drive was running. Different manufacturers would work, yes.
Which were probably installed and started up at nearly the same time. Oops.
This bug has the potential of simultaneously damaging whole sets of servers, if they were bought and installed in bulk. Dark day indeed.
This seems incredibly rich. If you have a bunch of this kit, and you don't immediately shut it down to apply firmware updates, then HPE wash their hands of the consequences.
If you use disks from the same batch in a RAID, they would all begin to fail around the same time, because all of them have the same lifetime more or less.
That's a curious bit of context. It seems to imply they're shifting some of the blame onto their manufacturer? I makes me wonder if this firmware is 100% HPE specific, or if there a 2^16 hours bug about to bite a bunch of other pipelines.
Some advice I've read is to only use unsigned integers if you want to explicitly opt-in to having overflow be defined behavior.
Our industry really sucks. We need languages where this can't happen, and we need testing procedures where these things are caught. I wonder if software is the industry with the lowest quality:importance ratio.
I didn't choose YOUR sub-vendor. You did. It's your responsibility to ensure that sub-vendor is operating at your standards. Passing blame to a sub-vendor indicates an unwillingness to take accountability.
I mean, that's no shock coming from HP.
Do they still not test these things with artificially incremented counters?
https://www.engadget.com/2015/05/01/boeing-787-dreamliner-so...
Not that throwing an exception on integer overflow is any better, unless you catch the exception. The classic example here is the Ariane 5 failure:
http://sunnyday.mit.edu/accidents/Ariane5accidentreport.html
"On September 20, 2013, NASA abandoned further attempts to contact the craft.[77] According to chief scientist A'Hearn,[78] the reason for the software malfunction was a Y2K-like problem. August 11, 2013, 00:38:49, was 232 tenth-seconds from January 1, 2000, leading to speculation that a system on the craft tracked time in one-tenth second increments since January 1, 2000, and stored it in an unsigned 32-bit integer, which then overflowed at this time, similar to the Year 2038 problem"
https://en.wikipedia.org/wiki/Deep_Impact_(spacecraft)#Conta...
>Most importantly, the company's already working on an update that will patch the software vulnerability -- though there's no word on when its jets will receive it.
My search of DDG turned up nothing about a resolution. Anyone know?
I know what I would recommend, but marketing would not like it ;-)
Finally, I found an obscure forum post telling me about a firmware bug happening at ~5K hours of disk usage. I updated the firmware and haven't had an issue since.
That's just 71 days of uptime and they hang. There are tens of thousands of these drives deployed as well.
It looks like such a bug isn't necessarily SSD specific if it completely bricks the drive.
And while 32.768 hours may seem like a long time for a drive, it's under 4 years of continuous operation. Not unheard of if used in a NAS.
Maybe. OTOH, plenty of people have been running spinning rust drives with way more than 4 years of power-on operation - if this bricking bug was common there, I'm pretty sure we would've noticed. SSD's are a newer tech and it's more common to replace them anyway as specs improve.
It doesn't need to be continuous. Total operation is the metric.
http://www.stbsuite.com/support/virtual-training-center/powe...
If you look elsewhere on the Internet, you'll find people with very old and working HDDs that have rolled over, so I suspect this bug is limited to a small number of drives.
(What that page says about not being able to reset it is... not true.)
Likewise, I'm skeptical of "neither the SSD nor the data can be recovered" --- they just want you to buy a new one.
Tangentially related, I wonder how many modern cars will stop working once the odometer rolls over.
If the firmware crashes during boot with negative hour counter, it probably could be only fixed by manually flashing new firmware over JTAG.
Since I run a SMART test every month it is easy to track the hourly progression (and thus rollover) in the event log as the events are reported in POH timing.
Regarding recovery: The FTL is likely toast, in which case while the data probably is unharmed and there, it's basically a giant block-sized jigsaw puzzle. With enough effort, and all the stars align - sure, you might be able to recover some/all of it.
Regarding un-bricking/reset: Potentially, no longer any access to wear-levels at the time. So the future integrity/reliability is kind-of dubious.
Looks like some sort of run time stored in a signed 2 byte integer. Oops.
Yes, this means that a field meant for diagnosing failures was responsible for a failure. Oops.
If only SSD vendors would do the usual cost-cutting measure of loading firmware from the host computer, this could be trivially fixed.
Our Sun workstations were very stable though.
https://fwupd.org/lvfs/vendors/ https://fwupd.org/lvfs/devices/
The system is quite a lot older than fwupd and less flakey usually. Google for hpsum or HP SPP
/s
Maybe HP and HPE were more tightly connected 3 years, 270 days 8 hours ago.
How is this work legally? For one, how would HPE prove that the customer read the bulletin? I don't imagine they're sending these out via certified mail.