Testing disks: Lessons from our odyssey selecting replacement SSDs (opens in new tab)

(bbc.co.uk)

131 pointsyavor-atanasov8y ago76 comments

76 comments

40 comments · 12 top-level

wtallis8y ago· 6 in thread

The biggest lesson to take away from this is probably that they thought they knew how to test a SSD, but were quite obviously clueless:

> we run a fairly comprehensive set of block-level tests using fio, consisting of both sequential and random asynchronous reads and writes straight to the disk. Then we throw a few timed runs of the venerable dd program at it.

Running dd as a benchmark is a major red flag. It show that they didn't know what they were doing with fio, and didn't trust its results. They later started using IOzone and a custom-written tool to accomplish stuff they should have done with fio in their initial testing.

They also did not mention pre-conditioning the drives or ensuring that their tests run long enough to reach a steady state. This is one of the most important aspects of enterprise SSD testing and they would have known that if they'd consulted any outside resources on the subject instead of making up their own testing guidelines from a position of extreme ignorance about the fundamentals of the hardware they were using and the details of their own workload.

They really should stop calling any of their tests "comprehensive".

jmiserez8y ago

I think you just expected to much from this article.

This is not a comprehensive guide to testing SSDs, it’s the story of what the author went through when trying to test SSDs. It’s well written and the author seemed to really engage with the topic and describe all the setbacks he had and research they did. I did not think he presented himself as an expert, just a software engineer tasked with upgrading their SSDs. And who knows, maybe this was only a 20% time project.

There are a lot of blog posts on HN that have much less actual content and where the authors have much less of a clue, yet often the response is overwhelmingly positive because someone took the time to write it up. You should really be more charitable here.

And to adress the calls for “outside experts”: If everyone called in outside experts for everything hardware (or software) related, we software engineers would never get to do anything cool or learn some new framework. We’d just be watching an outside expert do their thing. And outside “experts” are not necessarily better, often they might just sell themselves better. And who is going to check their work if the knowledge is all outsourced?

I think it’s great that the BBC lets their engineers do this and learn along the way, and a place where that is possible sounds like a nice place to work. It’s not like they had any downtime or anything because of this.

wtallis8y ago

> And to adress the calls for “outside experts”:

You completely misinterpreted (and misquoted) me on that one. I wasn't implying that they should hire a consultant for this kind of thing, but they should at least have bothered to read anything about the methodology used by SSD reviewers or the industry standard storage testing methodology freely published by organizations like SNIA. It's clear the BBC guys didn't even spend an afternoon trying to read up on how to evaluate SSD performance; they just jumped in and started re-inventing the wheel, hitting all the foreseeable problems along the way. It looks like they now have a clue and have learned a lot from the process, but this is not how you should handle this kind of upgrade.

1 more reply

fjsolwmv8y ago

BBC published a technical (but really PR) article written by amateurs posing as pros, instead of consulting reputable experts?

tomcart8y ago

To be clear, this isn't a news article written by our journalists - it is a piece written by the team themselves that we felt may be of interest to others, and that might help us do things better in the future. While I enjoyed reading it, I can assure you that SSD performance testing doesn't move the BBC PR needle compared to the identity of the new Doctor Who.

We're acutely aware that we've still got much to learn in this space, so if there are thoughts you have on how we could do better we're all ears.

Finally, while I assured you it wasn't a PR piece we're always looking for engineers in this area (and across the whole BBC) so if you'd be interested in helping us improve, get in touch.

2 more replies

wtallis8y ago

The problem isn't that the amateurs wrote the article, it's that the amateurs made the purchasing decisions that created the story in the first place.

sandworm1018y ago

No. They were duped by their own IT department. The reporters thought they were talking to experts on all things IT-related. The reality was that their team, while no doubt experts on many things, knew little about testing SSDs. This wasn't PR but a simple mistake in journalism. They should have talked to outside experts before publishing.

have_faith8y ago· 6 in thread

Great article, easy to follow considering it's far away from my normal domain.

I noticed they didn't mention any brands by name though, why is that?

tankenmate8y ago

The BBC has a very strong product prominence policy[0] (i.e. avoid naming brands when possible), being government funded is a large driver of this policy.

[0] http://www.bbc.co.uk/editorialguidelines/guidelines/editoria...

EDIT: fixed policy name and added link

anoother8y ago

It's a shame this is so selectively applied.

See, for example:

- The constant mention of speaking to people 'over Skype' on the News

- Publicization of Twitter hashtags on Questiontime and other programs

- Hours worth of Top Gear footage (and the entire Arctic Special) that were effectively Toyota Hilux advertisements

4 more replies

keypress8y ago

It's not Government funded. There's a mandatory TV tax (for TV viewers), the fee of which is set by Government.

2 more replies

dspillett8y ago

> I noticed they didn't mention any brands by name though, why is that?

As a public service broadcaster in the UK, the BBC must be very careful about naming specific brands and products due to rules laid out in their legal remit (and fear of legal action if from a party that feels unfairly disadvantaged by a competitor getting a good mention or them getting a bad one).

This is way Blue Peter always use "sticky backed plastic" instead of "sellotape" or "scotch tape", and people on BBC shows "vacuum" where the common parlance is "to hoover" (hoover being a brand name that got verbed like Google -> to google).

afandian8y ago

(This is almost too petty a point to reply, but "sticky backed plastic" is a sheet of adhesive transparent film, like you'd use to cover books. I think they probably would have said "sticky tape")

3 more replies

Bromskloss8y ago

> like Google -> to google

Hang on! Surely, "to google" refers to performing a search specifically with Google, right?

2 more replies

barrkel8y ago· 5 in thread

I built my home system early this year using the Samsung 960 Evo 1TB M2. Actual speeds were nowhere near advertised speeds until I enabled write-back cache on the drive, which gave me pause for concern about data persistence reliability. AFAIK the Samsung drivers (as opposed to the MS drivers I originally used) just turn this on without needing to be twiddled in settings.

Just to confirm, I have seen the behaviour described herein, with write-back cached making enormous difference with the Samsung EVO product in particular.

wtallis8y ago

Microsoft's NVMe driver is generally regarded within the storage industry as a bad joke. The fact that it has very different cache control behavior from their SATA driver despite the two using the same check box and same help text is really inexcusable.

dr_ick8y ago

Watch the following video that explains why the 960 EVO has poor write performance after exhausting its cache:

https://youtu.be/RqaZTwW_X2o?t=511

For sustained write performance you need the 960 PRO.

olavgg8y ago

I also have a Samsung 960 Evo. Its performance is what I consider a joke, fio and pg_test_fsync make it almost look as slow as spinning SAS drives.

For example on a 4kb sync write with 16 threads test, the 960 Evo cannot do more than 1000 iops. In comparison the Intel P4800X (Optane) does friggin 500 000 iops on the same test. That is a 500X difference.

https://forums.servethehome.com/index.php?threads/did-some-w...

wtallis8y ago

The 960 EVO is a consumer grade SSD with firmware tuned for bursts of I/O (through eg. the use of SLC write caching) at the expense of sustained write throughput. It doesn't have power loss protection capacitors, so it can't perform safe write caching when you're issuing synchronous writes. 4kB is much smaller than the underlying page size of its NAND flash, so performance is going to suck without write combining. You're testing it in shackles, with a workload that doesn't at all match its intended use case. That doesn't make it a joke, it just makes it the wrong kind of drive to use for stereotypical enterprise applications.

1 more reply

barrkel8y ago

It is nowhere near as slow as spinning drives, that's ludicrous. Mega IOPS are simply not required in a desktop. I'm not trying to run multiple VMs with multiple databases on this thing. In fact it's rarely writing at all.

1 more reply

HarryHirsch8y ago· 4 in thread

Flash memory has three operations, read, write and erase, the last two destructively. If you pretend they are harddisks with two operations of read and write you go through all sorts of contortions. Sometimes you fall flat on the face, as seen here.

Why don't operating systems treat SSDs more flash memory, and why doesn't the file system cooperate with the underlying hardware instead of pretending it's a disk? For home use that may even work, but in a demanding environment the extra complexity will invariably fail.

This is a genuine question, I'm an amateur here.

pkaye8y ago

Speaking as someone who used to work on SSD firmware, here are some rambling thoughts... Yes moving some of the FTL to the OS will help a lot in reducing the complexity for the SSD developer but the problems are just moved up the levels. The OS will probably still have to use a COW scheme aware of the block and page size restrictions of the underlying flash. And you can't do a raw disk copy without accounting for defective blocks. Maybe the SSD will still handle basic ECC protection and data scrambling but the OS will now have to handle read disturb, wear leveling, defect management, and data recovery using signal processing. But many of these characteristics will change from one NAND technology to another so someone will have to characterize and update the algorithms. I would actually say it is this last bit that really trips up SSD firmware design. Otherwise you would think after an iteration or two of firmware we would have a solid design but the flash technology tends to bring up some new requirements with each node that introduces more complexity.

wtallis8y ago

There is some work on Open-Channel SSDs, that move most of the flash translation layer (FTL) to the host system. There are two major problems with this approach:

1. Each OS that wants to use the drive needs a compatible implementation of the FTL. Consumer systems always have at least two operating systems in play (UEFI counts for these purposes). Enterprise systems are where you will actually find non-boot data-only drives.

2. Flash memory changes. The FTL needs very different parameters depending on whether you're using Toshiba flash or Samsung flash, and even depending on whether you're using last year's Toshiba flash or the stuff they're manufacturing today.

These aren't insurmountable problems, but they're enough to keep such products confined to a small niche. Instead, we're seeing a trend of SSDs accepting optional hints that allow them to perform the kinds of optimizations you'd expect from a fully host-managed SSD. The ATA TRIM command was just the tip of this iceberg.

_urga8y ago

Could you provide more details on these hints? Are they ioctl calls? Assuming one is using the disk as a raw block device, without a filesystem.

1 more reply

DSMan1952768y ago

> Why don't operating systems treat SSDs as flash memory, and why doesn't the file system cooperate with the underlying hardware instead of pretending it's a disk? For home use that may even work, but in a demanding environment the extra complexity will invariably fail.

The simple reason is because the SSDs themselves expose a regular HD interface and then does a lot of the flash-memory related stuff itself. For example, if you don't include TRIM support (Which early SSDs did not have) there is no 'erase' command the OS can send to an SSD.

With that in mind, SSDs also have memory controllers on them that map the blocks the OS sees to actual SSD blocks (scattered across the memory chips). So when the OS writes to block 1 it may write to block 15 internally on the SSD, and then block 2 might write to block 4002. Combine this with caching and other various details on the SSD side, and it leaves little predictable behavior for the OS to exploit.

nickcw8y ago· 3 in thread

This is the problem IMHO

> We also looked up whether our HBA used TRIM in its current configuration. It turns out, in RAID mode, the HBA did not support TRIM. We did do some trim-enabled testing with a different machine, but these results are hard to compare fairly. In any case, we can't currently enable TRIM on our production systems.

In our experience SSD write performance goes to sh*t if you don't regularly TRIM them.

Running fstrim once a day is enough to keep them healthy.

RAID cards not passing TRIM is a big problem for us too...

(Experience from day job at Hosting Provider)

masklinn8y ago

> In our experience SSD write performance goes to sht if you don't regularly TRIM them.
Interesting, is that because of the load? It seemed "modern" SSDs have GCs good enough that trim isn't quite necessary anymore to ensure good performances in consumer loads.
> RAID cards not passing TRIM is a big problem for us too...
Are there NVMe RAID cards? I assume they'd necessarily pass the command along considering
deallocate* is just one parameter/option of the DATA SET MANAGEMENT command, or do RAID cards just drop the entire command?

takeda8y ago

> Interesting, is that because of the load? It seemed "modern" SSDs have GCs good enough that trim isn't quite necessary anymore to ensure good performances in consumer loads.

A drive has no way to tell whether filesystem is using a given block or not. TRIM is a way for the filesystem to tell it that. So I would imagine the GC that you're referring to is working on the blocks marked with TRIM.

BTW, besides running fstrim from cron on Linux, you can also use discard flag to mount the drive, so the filesystem sends TRIM command when files are deleted.

1 more reply

wtallis8y ago

Modern consumer SSDs still benefit from TRIM, but are mostly able to keep up with GC just fine without it when subjected to typical consumer IO workloads, which are full of idle time for the drive to catch up. But if you fill an SSD to the brim, it'll slow down, and the cheapest SSDs will slow down a lot.

There are no hardware RAID solutions for NVMe, though there are now several hardware platforms supporting software RAID for NVMe devices in their motherboard firmware so you can boot from a NVMe RAID array. As with any other RAID solution, translating trim/unmap/deallocate commands takes a bit of effort, and less mature NVMe RAID solutions don't necessarily bother.

pricechild8y ago· 3 in thread

One of my favourite parts of this article is how Elliot Thomas describes himself as a "Software Engineer".

We may be writing software, but without a working knowledge of hardware it's not worth much!

blowski8y ago

I have barely any knowledge of hardware, but I have built plenty of pieces of software that have helped people.

ape48y ago

Civil Engineers don't understand chemistry or quantum mechanics - how they build a bridge.

1 more reply

fjsolwmv8y ago

The article is about pieces of hardware

pxlfkr8y ago· 1 in thread

Plugging SATA drives into a SAS HBA may not be optimal: "SAS/SATA expanders combined with high loads of ZFS activity have proven conclusively to be highly toxic" http://garrett.damore.org/2010/08/why-sas-sata-is-not-such-g...

equalunique8y ago

Interesting. This may explain a strange incident that I once encountered. One day I came home to my CSE-847 machine with SATA drives hooked to SAS expanders on an mpt device. The whole system was unresponsive and the drives were all as hot as a fresh pot of coffee. I immediately shut down the system and let the drives cool out on the concrete floor. Everything seemed to work later, but it was quite a scare. It was 12 2TB drives setup as 6 zraid2 mirrors.

linsomniac8y ago

This reminds me of testing I did years ago on ... CD-ROMs. Funny how lessons from old technology can apply to new technology.

Around 15 years ago my company did a Linux distribution on CDs: KRUD. It was updated monthly, and we had something like 400 subscribers. For various reasons we burned these CDs in house on a cluster I built.

We would burn, eject, read and checksum, and if the read test succeeded we would ship it out. We found some users with some discs had problems reading them. We contacted these users and paid them to return the CDs and did further testing on them.

Our initial test was using dd, and we found that the discs that were not obviously damaged in shipping, would tend to pass tests on some of our CD-ROM drives, but fail on others. But when they did succeed, they would tend to take longer than normal.

I wrote a new test program that instead of using dd directly used SCSI read commands, and timed every one. It would then count the number of reads that were "slow" (like 2x normal) and those that were "really slow" (like 5x), and if these got over a certain threshold we would throw away the disc.

Being able to time the raw operations was incredibly useful, and seems like it could have shown the authors of this paper problems before being deployed to production.

Except, they didn't really seem to do very thorough testing of the drives. Running stress testing on a 1TB drive for an hour seems pretty short.

Also in my above job we did hosting. We found that if we burned in disks by reading/writing to them 10 times ("badblocks -svw -p 10"), we would almost never experience drive failures on the Hitachi drives we were using. If we didn't do this, the drives would have a fairly high chance of falling out of the RAID array in production.

As drive sizes increased from 20GB to 200GB to 1TB, these tests started taking weeks to complete. But, they were totally worth it.

mjw10078y ago

One lesson here is that when reusing a previous test setup you ought to look for assumptions you made which are no longer valid.

If they'd been starting from scratch, while thinking about modern SSDs, it's quite likely they wouldn't have built an application load tester using files containing only dots.

But as it was an existing system, it didn't get the same amount of attention.

pcfe8y ago

You could give blkreplay a go next time you decide on which disks to buy. I find the additional effort is worth it, but ymmw. Use one of the shipped loads for a quick test, but you really want to run blktrace against your current setup and feed that data to blkreplay.

noir_lord8y ago

God damn that was well written, excellent post!

fulafel8y ago

Sounds like they are observing transparent data compression in the SSD controller and FTL. SandForce controllers even made a marketing point if it back in the day. It manifests as faster IO with repetitive data, along with reduced flash wear.

j / k navigate · click thread line to collapse

76 comments

40 comments · 12 top-level

wtallis8y ago· 6 in thread

The biggest lesson to take away from this is probably that they thought they knew how to test a SSD, but were quite obviously clueless:

They really should stop calling any of their tests "comprehensive".

jmiserez8y ago

I think you just expected to much from this article.

wtallis8y ago

> And to adress the calls for “outside experts”:

1 more reply

fjsolwmv8y ago

BBC published a technical (but really PR) article written by amateurs posing as pros, instead of consulting reputable experts?

tomcart8y ago

We're acutely aware that we've still got much to learn in this space, so if there are thoughts you have on how we could do better we're all ears.

Finally, while I assured you it wasn't a PR piece we're always looking for engineers in this area (and across the whole BBC) so if you'd be interested in helping us improve, get in touch.

2 more replies

wtallis8y ago

The problem isn't that the amateurs wrote the article, it's that the amateurs made the purchasing decisions that created the story in the first place.

sandworm1018y ago

have_faith8y ago· 6 in thread

Great article, easy to follow considering it's far away from my normal domain.

I noticed they didn't mention any brands by name though, why is that?

tankenmate8y ago

The BBC has a very strong product prominence policy[0] (i.e. avoid naming brands when possible), being government funded is a large driver of this policy.

[0] http://www.bbc.co.uk/editorialguidelines/guidelines/editoria...

EDIT: fixed policy name and added link

anoother8y ago

It's a shame this is so selectively applied.

See, for example:

- The constant mention of speaking to people 'over Skype' on the News

- Publicization of Twitter hashtags on Questiontime and other programs

- Hours worth of Top Gear footage (and the entire Arctic Special) that were effectively Toyota Hilux advertisements

4 more replies

keypress8y ago

It's not Government funded. There's a mandatory TV tax (for TV viewers), the fee of which is set by Government.

2 more replies

dspillett8y ago

> I noticed they didn't mention any brands by name though, why is that?

afandian8y ago

(This is almost too petty a point to reply, but "sticky backed plastic" is a sheet of adhesive transparent film, like you'd use to cover books. I think they probably would have said "sticky tape")

3 more replies

Bromskloss8y ago

> like Google -> to google

Hang on! Surely, "to google" refers to performing a search specifically with Google, right?

2 more replies

barrkel8y ago· 5 in thread

Just to confirm, I have seen the behaviour described herein, with write-back cached making enormous difference with the Samsung EVO product in particular.

wtallis8y ago

dr_ick8y ago

Watch the following video that explains why the 960 EVO has poor write performance after exhausting its cache:

https://youtu.be/RqaZTwW_X2o?t=511

For sustained write performance you need the 960 PRO.

olavgg8y ago

I also have a Samsung 960 Evo. Its performance is what I consider a joke, fio and pg_test_fsync make it almost look as slow as spinning SAS drives.

https://forums.servethehome.com/index.php?threads/did-some-w...

wtallis8y ago

1 more reply

barrkel8y ago

1 more reply

HarryHirsch8y ago· 4 in thread

This is a genuine question, I'm an amateur here.

pkaye8y ago

wtallis8y ago

There is some work on Open-Channel SSDs, that move most of the flash translation layer (FTL) to the host system. There are two major problems with this approach:

_urga8y ago

Could you provide more details on these hints? Are they ioctl calls? Assuming one is using the disk as a raw block device, without a filesystem.

1 more reply

DSMan1952768y ago

nickcw8y ago· 3 in thread

This is the problem IMHO

In our experience SSD write performance goes to sh*t if you don't regularly TRIM them.

Running fstrim once a day is enough to keep them healthy.

RAID cards not passing TRIM is a big problem for us too...

(Experience from day job at Hosting Provider)

masklinn8y ago

takeda8y ago

> Interesting, is that because of the load? It seemed "modern" SSDs have GCs good enough that trim isn't quite necessary anymore to ensure good performances in consumer loads.

BTW, besides running fstrim from cron on Linux, you can also use discard flag to mount the drive, so the filesystem sends TRIM command when files are deleted.

1 more reply

wtallis8y ago

pricechild8y ago· 3 in thread

One of my favourite parts of this article is how Elliot Thomas describes himself as a "Software Engineer".

We may be writing software, but without a working knowledge of hardware it's not worth much!

blowski8y ago

I have barely any knowledge of hardware, but I have built plenty of pieces of software that have helped people.

ape48y ago

Civil Engineers don't understand chemistry or quantum mechanics - how they build a bridge.

1 more reply

fjsolwmv8y ago

The article is about pieces of hardware

pxlfkr8y ago· 1 in thread

equalunique8y ago

linsomniac8y ago

This reminds me of testing I did years ago on ... CD-ROMs. Funny how lessons from old technology can apply to new technology.

Being able to time the raw operations was incredibly useful, and seems like it could have shown the authors of this paper problems before being deployed to production.

Except, they didn't really seem to do very thorough testing of the drives. Running stress testing on a 1TB drive for an hour seems pretty short.

As drive sizes increased from 20GB to 200GB to 1TB, these tests started taking weeks to complete. But, they were totally worth it.

mjw10078y ago

One lesson here is that when reusing a previous test setup you ought to look for assumptions you made which are no longer valid.

If they'd been starting from scratch, while thinking about modern SSDs, it's quite likely they wouldn't have built an application load tester using files containing only dots.

But as it was an existing system, it didn't get the same amount of attention.

pcfe8y ago

noir_lord8y ago

God damn that was well written, excellent post!

fulafel8y ago

j / k navigate · click thread line to collapse