HPE Drive fail at 32,768 hours without firmware update (opens in new tab)

(support.hpe.com)

229 pointsabarringer6y ago114 comments

114 comments

71 comments · 16 top-level

abarringerOP6y ago· 10 in thread

Since most drives are started and used concurrently this bug would blow any RAID set up. There's a dark day coming for some sysadmins.

vidarh6y ago

This is why I never use drives from the same batch, ideally never the same model, and usually not the same manufacturer. It happens way too regularly that drives start failing around the same time.

AdamJacobMuller6y ago

I don't see the value in that in most cases, honestly.

If you have, say, a 10-drive wide RAID6 you would need to source drives from 5 manufacturers/batches/models in order to be resilient to that kind of failure. Even if that was feasible that seems horrible to maintain long-term.

Doing a red/blue setup where your red systems use one type of drive and your blue systems use another type of drive seems like it could be reasonably accomplished.

1 more reply

ghaff6y ago

Good luck doing that at scale though. You can mix things up to some degree (and probably should) but if you need thousands of drives you're going to end up with lots from the same batch.

1 more reply

GrayShade6y ago

> I never use drives from the same batch

Note that it wouldnțt help in this instance, as the bug is caused by the amount of time a drive was running. Different manufacturers would work, yes.

3 more replies

Piskvorrr6y ago

Amen. Bought a pair of brand new disks some years ago, which failed days into the deployment...apparently from a submarine batch. Luckily the array had another, older disk, which kept it up until a replacement arrived.

1 more reply

HorstG6y ago

Different SSD vendors is impossible with HP servers and controllers, they only talk to their own expensive gear. So the disk diversity option is off the table for HP customers.

1 more reply

cesarb6y ago

That's only if the sysadmin was trusting a single server with the data, instead of a pair of redundant servers.

Which were probably installed and started up at nearly the same time. Oops.

This bug has the potential of simultaneously damaging whole sets of servers, if they were bought and installed in bulk. Dark day indeed.

abarringerOP6y ago

We have a cluster of four nodes that were all setup and brought online within hours of each other. The entire cluster could blow up within a couple hours if not patched.

1 more reply

angrygoat6y ago

> By disregarding this notification and not performing the recommended resolution, the customer accepts the risk of incurring future related errors.

This seems incredibly rich. If you have a bunch of this kit, and you don't immediately shut it down to apply firmware updates, then HPE wash their hands of the consequences.

manls6y ago

One of the general best practices is to have diversity in the array of drives. It's not for bugs like this though although it helps for bugs like this. It's to ensure that not all disks fail at the same time.

If you use disks from the same batch in a RAID, they would all begin to fail around the same time, because all of them have the same lifetime more or less.

verytrivial6y ago· 10 in thread

> HPE was notified by a Solid State Drive (SSD) manufacturer [...]

That's a curious bit of context. It seems to imply they're shifting some of the blame onto their manufacturer? I makes me wonder if this firmware is 100% HPE specific, or if there a 2^16 hours bug about to bite a bunch of other pipelines.

q3k6y ago

Of course HPE doesn't write their own firmware from scratch. It's likely just whitelabeled by the drive manufacturer. HPE is just a reseller of existing OEM drives, like all other enterprise server manufacturers are.

tyingq6y ago

It is possible, though, that this firmware was written specifically for HPE by the OEM.

2 more replies

wtallis6y ago

This is a 2^15 hours bug, not a 2^16 hours bug. Odds are, somewhere in the SSD firmware source code, there's a missing "unsigned". And in the Makefile, probably a missing "-Wextra".

rocqua6y ago

Using signed integers for values that are always positive isn't necessarily a mistake. Most notably because for signed integers (in C and C++) overflow is undefined behavior. This allows for more aggressive optimizations by the compiler.

Some advice I've read is to only use unsigned integers if you want to explicitly opt-in to having overflow be defined behavior.

3 more replies

coldpie6y ago

> Odds are, somewhere in the SSD firmware source code, there's a missing "unsigned". And in the Makefile, probably a missing "-Wextra".

Our industry really sucks. We need languages where this can't happen, and we need testing procedures where these things are caught. I wonder if software is the industry with the lowest quality:importance ratio.

richthegeek6y ago

Presumably the bug would still be considered a bug if it occured at 65536 hours? The incorrectly-signed bit only makes it appear sooner, but it's not the bug.

2 more replies

garaetjjte6y ago

2^16 would be still rather too small, 65536 hours is not impossible duration.

SkyPuncher6y ago

I'm always baffled when companies try to pass these issues off onto one of their sub-vendors - especially for critical issues like this.

I didn't choose YOUR sub-vendor. You did. It's your responsibility to ensure that sub-vendor is operating at your standards. Passing blame to a sub-vendor indicates an unwillingness to take accountability.

olyjohn6y ago

> an unwillingness to take accountability.

I mean, that's no shock coming from HP.

icegreentea26y ago

I probably wouldn't even call it blame. Certainly if HPE isn't doing a full firmware audit (which I don't expect them to do), there's no way to run into this issue until it shows up in life testing. The manufacturer/supplier would be in the best position to encounter these types of issues first.

jzwinck6y ago· 7 in thread

Those who forget history are doomed to repeat it. Just seven years ago Crucial sold tens of thousands of their "M4" SSDs with a firmware bug that made them fail after 5184 hours: https://www.anandtech.com/show/5424/crucial-provides-a-firmw...

Do they still not test these things with artificially incremented counters?

mhandley6y ago

Boeing didn't even test their 787 aircraft for integer overflows, and that's in a safety-critical environment, so I'm not sure I'd expect SSD vendors to be any better.

https://www.engadget.com/2015/05/01/boeing-787-dreamliner-so...

Not that throwing an exception on integer overflow is any better, unless you catch the exception. The classic example here is the Ariane 5 failure:

http://sunnyday.mit.edu/accidents/Ariane5accidentreport.html

planteen6y ago

This same problem also led to the loss of the Deep Impact spacecraft on its extended mission:

"On September 20, 2013, NASA abandoned further attempts to contact the craft.[77] According to chief scientist A'Hearn,[78] the reason for the software malfunction was a Y2K-like problem. August 11, 2013, 00:38:49, was 232 tenth-seconds from January 1, 2000, leading to speculation that a system on the craft tracked time in one-tenth second increments since January 1, 2000, and stored it in an unsigned 32-bit integer, which then overflowed at this time, similar to the Year 2038 problem"

https://en.wikipedia.org/wiki/Deep_Impact_(spacecraft)#Conta...

1 more reply

everybodyknows6y ago

The 787 case is most fascinating in that while the bug is dead simple, a fix is not.

>Most importantly, the company's already working on an update that will patch the software vulnerability -- though there's no word on when its jets will receive it.

My search of DDG turned up nothing about a resolution. Anyone know?

I know what I would recommend, but marketing would not like it ;-)

andonisus6y ago

This exact thing happened to me. I went crazy for a week straight testing every other component of my PC. I was convinced it was the graphics cards drawing too much power. Then it was clear that my OS was corrupted and needed to be reinstalled.

Finally, I found an obscure forum post telling me about a firmware bug happening at ~5K hours of disk usage. I updated the firmware and haven't had an issue since.

touisteur6y ago

I thought the coding/design pattern was to set the initial value of any counter 1 minute (or whatever eon makes sense in your application) from the roll-over so you'd see it 'right away' if it was badly handled. It's like you should use specific types with default values...

djsmiley2k6y ago

Some early Intel SSD's did the same thing, prior to M4's... haha

jzwinck6y ago

Do you have a link for the old Intel bug? Here's one for a new Intel bug after just 1700 power-on hours on some enterprise-class SSDs that are still being sold today: https://www.intel.com/content/www/us/en/support/articles/000...

That's just 71 days of uptime and they hang. There are tens of thousands of these drives deployed as well.

3 more replies

zozbot2346y ago· 7 in thread

Ouch. I wonder how many non-enterprise SSD's come with similar bugs, and zero support by the firmware vendor.

close046y ago

> neither the SSD nor the data can be recovered

It looks like such a bug isn't necessarily SSD specific if it completely bricks the drive.

And while 32.768 hours may seem like a long time for a drive, it's under 4 years of continuous operation. Not unheard of if used in a NAS.

zozbot2346y ago

> It looks like such a bug isn't necessarily SSD specific if it completely bricks the drive.

Maybe. OTOH, plenty of people have been running spinning rust drives with way more than 4 years of power-on operation - if this bricking bug was common there, I'm pretty sure we would've noticed. SSD's are a newer tech and it's more common to replace them anyway as specs improve.

3 more replies

lonelappde6y ago

It's the drive firmware. Drive firmware bricks the disk because the disk is soldered into the drive.

It doesn't need to be continuous. Total operation is the metric.

1 more reply

AnIdiotOnTheNet6y ago

Outside of the deep-pocket money-doesn't-matter sized enterprises, 4 years can be less than half the expected lifetime for IT kit.

junglecat6y ago

I'm guessing it's not a particularly productive way to store timestamps.

lonelappde6y ago

That's a very coincidental reason to go for a 3 year warranty.

rob746y ago

Unfortunately, even if you operate these drives continuously from the day you buy them, they will take 3 years, 270 days and 8 hours to fail (as someone else kindly calculated), so a 3-year warranty won't help you in this case...

userbinator6y ago· 5 in thread

According to this page, the SMART hour counter is only 16 bits, and rollover should be harmless:

http://www.stbsuite.com/support/virtual-training-center/powe...

If you look elsewhere on the Internet, you'll find people with very old and working HDDs that have rolled over, so I suspect this bug is limited to a small number of drives.

(What that page says about not being able to reset it is... not true.)

Likewise, I'm skeptical of "neither the SSD nor the data can be recovered" --- they just want you to buy a new one.

Tangentially related, I wonder how many modern cars will stop working once the odometer rolls over.

garaetjjte6y ago

>Likewise, I'm skeptical of "neither the SSD nor the data can be recovered"

If the firmware crashes during boot with negative hour counter, it probably could be only fixed by manually flashing new firmware over JTAG.

userbinator6y ago

...and likely some of the data recovery companies already know about and are prepared for this.

1 more reply

consp6y ago

Anecdotal so you don't have to look elsewhere: I can confirm that at least two of my NAS HD drives have rolled over once. Drives usually do nothing and spin up once every two weeks or so. No problems. Though SMART only says 16 bits I also have one drive which has over 16 bit hours of operation reported and is still counting so 16 bit is not universal.

Since I run a SMART test every month it is easy to track the hourly progression (and thus rollover) in the event log as the events are reported in POH timing.

kjetijor6y ago

> Likewise, I'm skeptical of "neither the SSD nor the data can be recovered" --- they just want you to buy a new one.

Regarding recovery: The FTL is likely toast, in which case while the data probably is unharmed and there, it's basically a giant block-sized jigsaw puzzle. With enough effort, and all the stars align - sure, you might be able to recover some/all of it.

Regarding un-bricking/reset: Potentially, no longer any access to wear-levels at the time. So the future integrity/reliability is kind-of dubious.

pkaye6y ago

I used to develop SSD firmware. Remember that these things need to handle power loss at any point in time. We store lots of redundant copies of information on the NAND so its just a matter of running the code that rebuilds everything.

voiper16y ago· 5 in thread

>The issue affects SSDs with an HPE firmware version prior to HPD8 that results in SSD failure at 32,768 hours of operation (i.e., 3 years, 270 days 8 hours). After the SSD failure occurs, neither the SSD nor the data can be recovered. In addition, SSDs which were put into service at the same time will likely fail nearly simultaneously.

Looks like some sort of run time stored in a signed 2 byte integer. Oops.

cesarb6y ago

It's probably the SMART "hours of operation" field. I see no reason for anything else to be stored in units of hours instead of seconds.

Yes, this means that a field meant for diagnosing failures was responsible for a failure. Oops.

stefan_6y ago

But how does that brick the device? I guess the hour counter overflows, goes negative and that screws up a calculation later on, causing the firmware to crash (over and over again..)

If only SSD vendors would do the usual cost-cutting measure of loading firmware from the host computer, this could be trivially fixed.

mnw21cam6y ago

Please no. I may actually want to boot from one of those devices.

1 more reply

lonelappde6y ago

What do you mean? The fix is to load a firmware update from the host computer.

1 more reply

imtringued6y ago

Usually the SMART counter just wraps around back to 0. In this case it becomes negative because it was read as a signed short.

pjc506y ago· 4 in thread

Amazing. A repeat of the "Windows 95 crashes after 48 days uptime" and other timer rollover bugs.

macintux6y ago

I’ve always appreciated the humor of the fact that Win95 was so unstable that no one noticed this bug until years later.

kube-system6y ago

It also used to be very common for people to turn off their computers when they were done using them.

1 more reply

tyingq6y ago

Linux wasn't so great in 1995 either. We regularly rebooted for various kernel, ip stack, etc, bugs that would crop up after a fairly short amount of uptime.

Our Sun workstations were very stable though.

winrid6y ago

I must have had great luck. My sister and I had an HP Win95 machine for gaming growing up. It never crashed. But it also never got weird software etc.

pabs36y ago· 2 in thread

Would be nice if the standard firmware update mechanism on Linux (fwupd/LVFS) could be used for HPE products.

https://fwupd.org/lvfs/vendors/ https://fwupd.org/lvfs/devices/

jabl6y ago

This so much. Even if you hate uefi with the fire of a thousand suns, there are some good things there. Like GPT, and the UEFI capsule thing that fwupd uses.

HorstG6y ago

HP actually has a working firmware update mechanism for all their gear. Its a bootable Linux liveDVD that starts into a browser talking to a local Tomcat instance which applies necessary patches. For many cases its also possible to invoke patching from your normal Linux installation. However, a reboot is mostly still necessary, e.g. for disk firmware which the controller applies after its own new firmware has been loaded (sometimes takes more than one reboot).

The system is quite a lot older than fwupd and less flakey usually. Google for hpsum or HP SPP

1 more reply

_bxg16y ago· 2 in thread

Whatever the counter is, the fact that it's 32,768 instead of 65,536 suggests they used a signed int for something that presumably starts at zero and increases monotonically... Avoiding just that mistake would've given them twice as much time - nearly 7.5 years - which seems like it'd be longer than these drives would typically last anyway.

imtringued6y ago

It would have avoided the problem in the first place because the SMART counter is allowed to roll over back to 0.

vortico6y ago

Maybe they're running on a 15-bit architecture where a signed int would be 16384?

bobowzki6y ago· 2 in thread

At the hospital where I work, almost all HP desktops crashed within a few months...

luma6y ago

This is for HPE (not HP, which is now a separate company). I haven't heard anything about HP (who makes desktops not storage arrays) experiencing this problem.

cotillion6y ago

Considering 900 drives out of 1800 crashed in HP computers at a Swedish hospital in the last few months I suspect there is a connection.

Maybe HP and HPE were more tightly connected 3 years, 270 days 8 hours ago.

2 more replies

S_A_P6y ago· 1 in thread

I just want to know how many of these failed at 32768 hours before they had their oh sh*t moment.

retrovm6y ago

Beats me but I happen to have a fleet of HP SATA (not SAS) drives and they just crossed this boundary, 32813 power on hours typical. I guess if their SATA firmware had this bug I'd be having a bad week.

gruez6y ago

>By disregarding this notification and not performing the recommended resolution, the customer accepts the risk of incurring future related errors.

How is this work legally? For one, how would HPE prove that the customer read the bulletin? I don't imagine they're sending these out via certified mail.

iveqy6y ago

Probably related to https://news.ycombinator.com/item?id=21471997

EvanAnderson6y ago

I did some recon on eBay looking for used units w/ the affected SKUs for sale and they appear to be Samsung units.

annoyingnoob6y ago

Whew, dodged that bullet, looks like I'm not using any of the affected drives. Lucky me, for now.

paggle6y ago

Yikes! This is why when I built my home NAS I used five different drives and manufacturers.

j / k navigate · click thread line to collapse

114 comments

71 comments · 16 top-level

abarringerOP6y ago· 10 in thread

Since most drives are started and used concurrently this bug would blow any RAID set up. There's a dark day coming for some sysadmins.

vidarh6y ago

This is why I never use drives from the same batch, ideally never the same model, and usually not the same manufacturer. It happens way too regularly that drives start failing around the same time.

AdamJacobMuller6y ago

I don't see the value in that in most cases, honestly.

Doing a red/blue setup where your red systems use one type of drive and your blue systems use another type of drive seems like it could be reasonably accomplished.

1 more reply

ghaff6y ago

Good luck doing that at scale though. You can mix things up to some degree (and probably should) but if you need thousands of drives you're going to end up with lots from the same batch.

1 more reply

GrayShade6y ago

> I never use drives from the same batch

Note that it wouldnțt help in this instance, as the bug is caused by the amount of time a drive was running. Different manufacturers would work, yes.

3 more replies

Piskvorrr6y ago

1 more reply

HorstG6y ago

Different SSD vendors is impossible with HP servers and controllers, they only talk to their own expensive gear. So the disk diversity option is off the table for HP customers.

1 more reply

cesarb6y ago

That's only if the sysadmin was trusting a single server with the data, instead of a pair of redundant servers.

Which were probably installed and started up at nearly the same time. Oops.

This bug has the potential of simultaneously damaging whole sets of servers, if they were bought and installed in bulk. Dark day indeed.

abarringerOP6y ago

We have a cluster of four nodes that were all setup and brought online within hours of each other. The entire cluster could blow up within a couple hours if not patched.

1 more reply

angrygoat6y ago

> By disregarding this notification and not performing the recommended resolution, the customer accepts the risk of incurring future related errors.

This seems incredibly rich. If you have a bunch of this kit, and you don't immediately shut it down to apply firmware updates, then HPE wash their hands of the consequences.

manls6y ago

If you use disks from the same batch in a RAID, they would all begin to fail around the same time, because all of them have the same lifetime more or less.

verytrivial6y ago· 10 in thread

> HPE was notified by a Solid State Drive (SSD) manufacturer [...]

q3k6y ago

tyingq6y ago

It is possible, though, that this firmware was written specifically for HPE by the OEM.

2 more replies

wtallis6y ago

This is a 2^15 hours bug, not a 2^16 hours bug. Odds are, somewhere in the SSD firmware source code, there's a missing "unsigned". And in the Makefile, probably a missing "-Wextra".

rocqua6y ago

Some advice I've read is to only use unsigned integers if you want to explicitly opt-in to having overflow be defined behavior.

3 more replies

coldpie6y ago

> Odds are, somewhere in the SSD firmware source code, there's a missing "unsigned". And in the Makefile, probably a missing "-Wextra".

richthegeek6y ago

Presumably the bug would still be considered a bug if it occured at 65536 hours? The incorrectly-signed bit only makes it appear sooner, but it's not the bug.

2 more replies

garaetjjte6y ago

2^16 would be still rather too small, 65536 hours is not impossible duration.

SkyPuncher6y ago

I'm always baffled when companies try to pass these issues off onto one of their sub-vendors - especially for critical issues like this.

olyjohn6y ago

> an unwillingness to take accountability.

I mean, that's no shock coming from HP.

icegreentea26y ago

jzwinck6y ago· 7 in thread

Do they still not test these things with artificially incremented counters?

mhandley6y ago

Boeing didn't even test their 787 aircraft for integer overflows, and that's in a safety-critical environment, so I'm not sure I'd expect SSD vendors to be any better.

https://www.engadget.com/2015/05/01/boeing-787-dreamliner-so...

Not that throwing an exception on integer overflow is any better, unless you catch the exception. The classic example here is the Ariane 5 failure:

http://sunnyday.mit.edu/accidents/Ariane5accidentreport.html

planteen6y ago

This same problem also led to the loss of the Deep Impact spacecraft on its extended mission:

https://en.wikipedia.org/wiki/Deep_Impact_(spacecraft)#Conta...

1 more reply

everybodyknows6y ago

The 787 case is most fascinating in that while the bug is dead simple, a fix is not.

>Most importantly, the company's already working on an update that will patch the software vulnerability -- though there's no word on when its jets will receive it.

My search of DDG turned up nothing about a resolution. Anyone know?

I know what I would recommend, but marketing would not like it ;-)

andonisus6y ago

Finally, I found an obscure forum post telling me about a firmware bug happening at ~5K hours of disk usage. I updated the firmware and haven't had an issue since.

touisteur6y ago

djsmiley2k6y ago

Some early Intel SSD's did the same thing, prior to M4's... haha

jzwinck6y ago

That's just 71 days of uptime and they hang. There are tens of thousands of these drives deployed as well.

3 more replies

zozbot2346y ago· 7 in thread

Ouch. I wonder how many non-enterprise SSD's come with similar bugs, and zero support by the firmware vendor.

close046y ago

> neither the SSD nor the data can be recovered

It looks like such a bug isn't necessarily SSD specific if it completely bricks the drive.

And while 32.768 hours may seem like a long time for a drive, it's under 4 years of continuous operation. Not unheard of if used in a NAS.

zozbot2346y ago

> It looks like such a bug isn't necessarily SSD specific if it completely bricks the drive.

3 more replies

lonelappde6y ago

It's the drive firmware. Drive firmware bricks the disk because the disk is soldered into the drive.

It doesn't need to be continuous. Total operation is the metric.

1 more reply

AnIdiotOnTheNet6y ago

Outside of the deep-pocket money-doesn't-matter sized enterprises, 4 years can be less than half the expected lifetime for IT kit.

junglecat6y ago

I'm guessing it's not a particularly productive way to store timestamps.

lonelappde6y ago

That's a very coincidental reason to go for a 3 year warranty.

rob746y ago

userbinator6y ago· 5 in thread

According to this page, the SMART hour counter is only 16 bits, and rollover should be harmless:

http://www.stbsuite.com/support/virtual-training-center/powe...

If you look elsewhere on the Internet, you'll find people with very old and working HDDs that have rolled over, so I suspect this bug is limited to a small number of drives.

(What that page says about not being able to reset it is... not true.)

Likewise, I'm skeptical of "neither the SSD nor the data can be recovered" --- they just want you to buy a new one.

Tangentially related, I wonder how many modern cars will stop working once the odometer rolls over.

garaetjjte6y ago

>Likewise, I'm skeptical of "neither the SSD nor the data can be recovered"

If the firmware crashes during boot with negative hour counter, it probably could be only fixed by manually flashing new firmware over JTAG.

userbinator6y ago

...and likely some of the data recovery companies already know about and are prepared for this.

1 more reply

consp6y ago

Since I run a SMART test every month it is easy to track the hourly progression (and thus rollover) in the event log as the events are reported in POH timing.

kjetijor6y ago

> Likewise, I'm skeptical of "neither the SSD nor the data can be recovered" --- they just want you to buy a new one.

Regarding un-bricking/reset: Potentially, no longer any access to wear-levels at the time. So the future integrity/reliability is kind-of dubious.

pkaye6y ago

voiper16y ago· 5 in thread

Looks like some sort of run time stored in a signed 2 byte integer. Oops.

cesarb6y ago

It's probably the SMART "hours of operation" field. I see no reason for anything else to be stored in units of hours instead of seconds.

Yes, this means that a field meant for diagnosing failures was responsible for a failure. Oops.

stefan_6y ago

But how does that brick the device? I guess the hour counter overflows, goes negative and that screws up a calculation later on, causing the firmware to crash (over and over again..)

If only SSD vendors would do the usual cost-cutting measure of loading firmware from the host computer, this could be trivially fixed.

mnw21cam6y ago

Please no. I may actually want to boot from one of those devices.

1 more reply

lonelappde6y ago

What do you mean? The fix is to load a firmware update from the host computer.

1 more reply

imtringued6y ago

Usually the SMART counter just wraps around back to 0. In this case it becomes negative because it was read as a signed short.

pjc506y ago· 4 in thread

Amazing. A repeat of the "Windows 95 crashes after 48 days uptime" and other timer rollover bugs.

macintux6y ago

I’ve always appreciated the humor of the fact that Win95 was so unstable that no one noticed this bug until years later.

kube-system6y ago

It also used to be very common for people to turn off their computers when they were done using them.

1 more reply

tyingq6y ago

Linux wasn't so great in 1995 either. We regularly rebooted for various kernel, ip stack, etc, bugs that would crop up after a fairly short amount of uptime.

Our Sun workstations were very stable though.

winrid6y ago

I must have had great luck. My sister and I had an HP Win95 machine for gaming growing up. It never crashed. But it also never got weird software etc.

pabs36y ago· 2 in thread

Would be nice if the standard firmware update mechanism on Linux (fwupd/LVFS) could be used for HPE products.

https://fwupd.org/lvfs/vendors/ https://fwupd.org/lvfs/devices/

jabl6y ago

This so much. Even if you hate uefi with the fire of a thousand suns, there are some good things there. Like GPT, and the UEFI capsule thing that fwupd uses.

HorstG6y ago

The system is quite a lot older than fwupd and less flakey usually. Google for hpsum or HP SPP

1 more reply

_bxg16y ago· 2 in thread

imtringued6y ago

It would have avoided the problem in the first place because the SMART counter is allowed to roll over back to 0.

vortico6y ago

Maybe they're running on a 15-bit architecture where a signed int would be 16384?

bobowzki6y ago· 2 in thread

At the hospital where I work, almost all HP desktops crashed within a few months...

luma6y ago

This is for HPE (not HP, which is now a separate company). I haven't heard anything about HP (who makes desktops not storage arrays) experiencing this problem.

cotillion6y ago

Considering 900 drives out of 1800 crashed in HP computers at a Swedish hospital in the last few months I suspect there is a connection.

Maybe HP and HPE were more tightly connected 3 years, 270 days 8 hours ago.

2 more replies

S_A_P6y ago· 1 in thread

I just want to know how many of these failed at 32768 hours before they had their oh sh*t moment.

retrovm6y ago

gruez6y ago

>By disregarding this notification and not performing the recommended resolution, the customer accepts the risk of incurring future related errors.

How is this work legally? For one, how would HPE prove that the customer read the bulletin? I don't imagine they're sending these out via certified mail.

iveqy6y ago

Probably related to https://news.ycombinator.com/item?id=21471997

EvanAnderson6y ago

I did some recon on eBay looking for used units w/ the affected SKUs for sale and they appear to be Samsung units.

annoyingnoob6y ago

Whew, dodged that bullet, looks like I'm not using any of the affected drives. Lucky me, for now.

paggle6y ago

Yikes! This is why when I built my home NAS I used five different drives and manufacturers.

j / k navigate · click thread line to collapse