What every programmer should know about SSDs (opens in new tab)

(databasearchitects.blogspot.com)

452 pointssprachspiel5y ago158 comments

158 comments

99 comments · 25 top-level

bob10295y ago· 13 in thread

Things I have learned about SSDs:

If you want to go fast & save NAND lifetime, use append-only log structures.

If you want to go even faster & save even more NAND lifetime, batch your writes in software (i.e. some ring buffer with natural back-pressure mechanism) and then serialize them with a single writer into an append-only log structure. Many newer devices have something like this at the hardware level, but your block size is still a constraint when working in hardware. If you batch in software, you can hypothetically write multiple logical business transactions per block I/O. When you physical block size is 4k and your logical transactions are averaging 512b of data, you would be leaving a lot of throughput on the table.

Going down 1 level of abstraction seems important if you want to extract the most performance from an SSD. Unsurprisingly, the above ideas also make ordinary magnetic disk drives more performant & potentially last longer.

pclmulqdq5y ago

I used to think the same thing, but now that I work on SSD-based storage systems, I'm not sure this holds up in today's storage stacks. Log structuring really helped with HDDs since it meant fewer seeks.

In particular, the filesystem tends to undo a lot of the benefits you get from log-structuring unless you are using a filesystem designed to keep your files log-structured. Using huge writes definitely still helps, though.

A paper that I really like goes deeper into this: http://pages.cs.wisc.edu/~jhe/eurosys17-he.pdf

Edit: I had originally said "designed for flash" instead of "designed to keep your files log-structured." F2FS is designed for flash, but in my testing does relatively poorly with log-structured files because of how it works internally.

Edit 2: de-googled the link. Thank you for pointing that out.

10000truths5y ago

Achieving cutting-edge storage performance tends to require bypassing the filesystem anyways. Traditionally, that meant using SPDK. Nowadays, opening /dev/nvme* with O_DIRECT and operating on it with io_uring will get you most of the way there.

In either case, the advice given in the article and by the OP is filesystem agnostic.

3 more replies

trulyme5y ago

Degoogled link: http://pages.cs.wisc.edu/~jhe/eurosys17-he.pdf

gravypod5y ago

This is the "secret sauce" behind LevelDB: https://github.com/google/leveldb#performance

bob10295y ago

This looks to be a similar technique.

In my testing of these ideas, I've been able to push over 2 million transactions per second (~1Kb per transaction) to a Samsung 960 Pro. For reference, its rated for 2.1GB/s sequential writes, so I've got it pretty much 100% saturated.

The implementation for something like this is actually really underwhelming when you figure out how to put all the pieces together. I assembled this prototype (also a key-value store) using .NET5, LMAX Disruptor, and a splay tree implementation i copied from google somewhere. The hardest part was figuring out how to wait for write completion on the caller side (multiple calling threads are ultimately serialized into a single worker thread via the Disruptor). Turns out, busy wait for a few thousand cycles followed by a yield to the OS is a pretty good trick. You just do a while(true) over a completion flag on the transaction object which is set en masse by the handling thread after the write goes to disk. Batch sizes are determined dynamically based on how long the previous batch took to write. In practice, I never observed a batch that took longer than 2-3 milliseconds on my 960 pro. Max batch size is 4096, and it is permanently full when 100% loaded. A full batch = a nice big IO to disk.

ww5205y ago

LMDB has similar write characteristics where its b-tree is append-only. This gives LMDB amazing performance and very robust ACID transaction support as immutability is baked in.

fulafel5y ago

This is quite common in traditional DBs too. Eg PostgreSQL has its write-ahead log. Both LMDB and PostgreSQL then occasionally need to do do some kind of compaction, checkpoint or garbage collection, whatever it's called in various systems, the write-only log is reset and any live data in it improted into the main db data.

1 more reply

remram5y ago

Shouldn't the OS or libc take care of that? If I write and don't immediately flush()?

KMag5y ago

I don't think most libc implementations take care to buffer to filesystem block/cluster boundaries.

AtlasBarfed5y ago

This is basically the purpose of rocksdb, and to a lesser extent Cassandra

senderista5y ago

Also: parallelize your writes. This is the biggest difference between SSDs and HDDs: internal parallelism. You’ll have a hard tine saturating I/O bandwidth even with huge sequential writes if you don’t introduce some parallelism. Fortunately, io_uring makes this easy from a single thread.

hypertele-Xii5y ago

Buffering writes is fine if you're ok with losing your data. For some applications that's acceptable, but when I'm writing to disk, it's because I want persistence. "It'll get flushed to disk at some point as long as power doesn't go out" is hardly that.

scns5y ago

Like this?

https://en.wikipedia.org/wiki/NILFS?wprov=sfla1

jedberg5y ago· 9 in thread

This page tells me a lot about SSDs, but it doesn't tell me why I need to know these things. It doesn't really give me any indication about how I should change my behavior if I know that I'll be running on SSD vs spinning disk.

I've always been told, "just treat SSDs like slow, permanent memory".

ifdefdebug5y ago

For instance, when reading this sqlite came immediately to my mind and how much a 10000 loop of inserts without begin/commit or some preparing pragmas would wreck a ssd... (forces a full sync between each two inserts)

jedberg5y ago

Not really though, because your kernel would most likely abstract that away and bunch up the writes.

1 more reply

gruez5y ago

Fortunately most people aren't running OLTP workloads on client SSDs. That's mostly done on enterprise SSDs that have much higher endurance. That said even on client SSDs you can probably get away with running such workloads as long as you're not doing them 24/7.

1 more reply

ChrisArchitect5y ago

yes, it's a weak post

it's really about linking to the tutorial and papers it links at the end, which is some thing from 2014

And that was discussed here 6 years ago: https://news.ycombinator.com/item?id=9049630

fulafel5y ago

Indeed. The summary talks about what you need to do to saturate a SSDs read and write bandwidth. I guess the post would find its audience better if the title was "What a programmer should about SSDs when optimizing IO".

I'd be more interested in the trends in SSD behaviour are. It seems SSDs have bigger and bigger DRAM caches and wear ceased to be an issue many years ago, so there's not much payoff in the write side advice of the article.

tffgg5y ago

Actually wear becomes increasingly more important as DRAM caches are removed to save money. And SSDs tend to have less write volume per unit

danbst5y ago

yeah, article should talk about periodic TRIMming, though this is more an admin advice

PaulKeeble5y ago

I have found trim is not sufficient at least on Windows, we still need to rarely defragment SSDs from what I can tell.

On a Windows server we were having SSD performance issues where sequential reads were often down to 100MB/s, it was kind of confusing but we tried all sorts of ways to copy it with the same result. I eventually tested the drive with a fragmentation tool and it was really high at 80% but most importantly the problem files had so many fragments that they were tending towards 4k IO reads.

What I did was remove all the files to another drive, force trimmed the drive and gave it several hours to sort itself out and then copied them back and performance was restored to 550MB/s as would be expected.

I wrote a quick go program to test sequential read speed of all files across all the drives and I found plenty of files where performance was degraded. This was across a range of SSDs I had, SATA and NVMe from differing vendors. I suspect this is a bigger problem than most people realise, normal use absolutely can get the drive into a bad performing state and trim wont fix it. Very few people expect that the drive will degrade down to its 4K IO speed on a sequential copy but it apparently can.

smt885y ago

Don't modern OSes transparently TRIM periodically anyway?

1 more reply

teddyh5y ago· 8 in thread

What everyone should know is that flash drives can lose their data when left unpowered for as little as three months.

crazygringo5y ago

Do you have a current source for that?

I've turned on plenty of cell phones that hadn't been charged or powered on for a couple of years and everything worked normally. Same with thumb drives I've picked up after years.

I mean, anything can fail after three months. Your statement doesn't really add anything without stating the failure rates. For all I know the failure rate could be less than that of physical hard drives.

gruez5y ago

https://images.anandtech.com/doci/9248/2_575px.PNG

from https://www.anandtech.com/show/9248/the-truth-about-ssd-data...

1 more reply

mercora5y ago

if that is true disks should come with a very visible note stating this... seriously, 3 months would be nothing. i doubt it is true because 3 months is a time frame which should be surpassed quite often making this more known.

wtallis5y ago

Three months is the minimum standard for data retention from an enterprise SSD that has used up its entire write endurance and reached end of life, but is still being stored in a hot chassis.

Outside of that narrow scenario, the three months figure is wildly wrong and should not be repeated. Lower temperatures, a consumer drive, and not having used up 100% of the write endurance will all drastically lengthen data retention.

(However, under no circumstances should you trust a cheap USB thumb drive to retain your data. Those tend to use lower-grade flash memory and lower-quality controllers. If you need an external device to reliably cart around data, shop for a "portable SSD", not a "USB flash drive".)

teddyh5y ago

Depending on manufacturer, and storage conditions, it can be up to about ten years. But the “three months” number is real: https://web.archive.org/web/20210502042514/http://www.dell.c...

1 more reply

anticensor5y ago

Yep, they are semivolatile limited write memory modules, not disks. Everyone should use that SV-LWMM acronym.

teddyh5y ago

It occurs to me now that the key word here may be “unpowered”. As in, if you unplug an SSD and leave it on the shelf, it may lose (some) data in as little as three months. There might not be many people who do that, and those who do might not notice the occasional corruption.

AtlasBarfed5y ago

Is this an actual useful application if optane, replacing the memory with near-ram nonvolatile ?

andrewmcwatters5y ago· 7 in thread

My opinion is probably... not technically correct... until you have to deal with drive reliability and write guarantees, but I don't think programmers actually have to know anything about SSDs in the same way that developers had to know particular things about HDDs.

This is out of pure speculation, but there had to be a period of time during the mass transition to SSDs that engineers said, OK, how do we get the hardware to be compatible with software that is, for the most part, expecting that hard disk drives are being used, and just behave like really fast HDDs.

So, there's almost certainly some non-zero amount of code out there in the wild that is or was doing some very specific write optimized routine that one day was just performing 10 to 100 times faster, and maybe just because of the nature of software is still out there today doing that same routine.

I don't know what that would look like, but my guess would be that it would have something to do with average sized write caches, and those caches look entirely different today or something.

And today, there's probably some SSD specific code doing something out there now, too.

forrestthewoods5y ago

Games used to spend a lot time optimizing CD/DVD layout. Because reading from that is REALLY slow. Optimize mostly meant keep data contiguous. But sometimes it meant duplicate data to avoid seeks.

The canonical case is minimize time to load a level. Keep that level’s assets contiguous. And maybe duplicate data that is shared across levels. It’s a trade off between disc space and load time.

I’m not familiar with major tricks for improving after a disc is installed to drive. (PS4 games always streamed data from HDD, not disc.)

Even consoles use different HDD manufacturers. So it’d be pretty difficult to safely optimize for that. I’m sure a few games do. But it’s rare enough I’ve never heard of it.

fart325y ago

While reading through the Quake 3 source code, I noticed that whenever the FS functions were reading from a CD, they were doing so in a loop, because the fread/fopen functions instead of hanging and waiting for the CD to spin up sometimes just returned an error. It wasn't just slow, it was also hilarious at times.

andrewmcwatters5y ago

This reminds me of Valve’s GCF (grid cache file, officially, or game cache file, commonly). The benefits must have purely occurred on consoles for the reasons you outlined, because cracked Valve games that had GCF files extracted ran faster than the official retail releases on PCs!

patmorgan235y ago

Stream loading is another technique that's used to reduce load time. You start loading data for the next level as the player approaches a boundry and you let them enter the next level before all of the assets(ussally textures) have finished loading.

alpaca1285y ago

Consoles also do this with HDDs. That's been one of the talking points around the PS5 from the beginning, with Sony saying that games would get more storage space efficient because they don't need redundancy for faster loading anymore.

1 more reply

rzzzt5y ago

You can optimize for less/shorter drive seeks on rotational media by reordering requests: https://en.wikipedia.org/wiki/Elevator_algorithm

hugey0105y ago

Right, the average programmer probably should, or already is, depending on some existing abstraction to optimizes writes based on storage medium.

rabuse5y ago· 7 in thread

A little off topic, but I bought a new Macbook Pro with the M1 chip with 8GB of RAM, and I'm worried about the swap usage of this machine wearing out the SSD too quickly. Is this an actual concern, as my swap has been in the multiple GB range with my use?

cbsmith5y ago

It's an actual concern for you. For Apple it's a variant on planned obsolescence. ;-)

Note though that memory use metrics on MacOS can been a misleading. Make sure that you're seeing what's actually there.

ksec5y ago

Generally speaking macOS is extremely write heavy for all sort of reason even before the switch to ARM. But in majority of case if should last 4-5 years without problem.

The heavy write bug Apple said was due to misreporting and was fixed ( so they say ).

I do think you should pay attention to it from time to time. iCloud Sync, Spotlight, Safari heavy tabs are all known to cause heavy paging in some corner case. You might end up having a TB of data written for no apparent reason. Apple used to ship their Macbook with MLC, on a 512GB MLC you could do 500TBW without problem, that is ~13 years of usage if you do 100GB write per day. Not sure about the M1 machines.

If you are doing Dev staging, Video and photos editing a lot these drive will fail quite quickly. In the space of 2 - 3 years. Although some would argue MacBook Air are not made for those task. And especially true if you have 8GB and 256GB NAND.

Grazester5y ago

Why did you get the 8 gig version? If you are using all this swap then your purchased the wrong MacBook.

rabuse5y ago

Honestly, don't run much, so didn't think it would be that bad stepping down from my 16GB machine.

1-65y ago

From what I’ve been able to gather, the excessive paging may actually have to do with non-native apps running on the M1. Avoid those.

rabuse5y ago

Most of my programs are JetBrains IDE's and browsers. Don't know if they're optimized for M1.

1 more reply

raihansaputra5y ago

I think the excessive wear was caused by a bug. Try upgrading to the .4 release.

personjerry5y ago· 6 in thread

How big is the write cache usually and how does it work? Typically I've seen the write caches be something like 32MB in size, but the "top speed" seems to be sustained for files much bigger than 32MB, which doesn't make sense to me if that top speed is supposedly from writing to the cache. How does that work?

wtallis5y ago

Getting full throughput from the SSD is less about file size and more about how much work is in the SSD's queue at any given moment. If the host system only issues commands one at a time (as would often result from using synchronous IO APIs), then the SSD will experience some idle time between finishing one command and receiving the next from the host system. If the host ensures there are 2+ commands in the SSD's queue, it won't have that idle time.

Then there's the matter of how much data is in the queue, rather than how many commands are queued. Imagine a 4 TB SSD using 512Gbit TLC dies, and an 8-channel controller. That's 64 dies with 2 or 4 planes per die. A single page is 16kB for current NAND, so we need 2 or 4 MB of data to write if we want to light up the whole drive at once, and that much again waiting in the queue to ensure the drive can begin the next write as soon as the first batch completes. But you can often hit a bottleneck elsewhere (either the PCIe link, or the channels between the controller and NAND) before you have every plane of every die 100% busy.

If you're working with small files, then your filesystem will be producing several small IOs for each chunk of file contents you read or write from the application layer, and many of those small metadata/fs IOs will be in the critical path, blocking your data IOs. So even though you can absolutely hit speeds in excess of 3 GB/s by issuing 2MB write commands one at a time to a suitably high-end SSD, you may have more difficulty hitting 3 GB/s by writing 2MB files one at a time.

opencl5y ago

It varies quite a bit. There are two different types of caches: SLC and DRAM. Most drives use SLC caching, higher end drives often use both.

Typically the SSDs with DRAM have a ratio of 1GB DRAM per TB of flash.

SLC caching is using a portion of the flash in SLC mode, where it stores 1 bit per cell rather than the typical 2-4 (2 for MLC, 3 for TLC, 4 for QLC) in exchange for higher performance. SLC cache size varies wildly. Some SSDs allocate a fixed size cache, some allocate it dynamically based on how much free space is available. It can potentially be 10s of GBs on larger SSDs.

igg5y ago

The 1 GB DRAM per 1 TB Flash is to store the Flash Translation Layer mapping from logical addresses of the host system to the physical address in Flash. The write cache is separate and much more limited in size.

bserge5y ago

On SSDs? 32 is way off, the Samsung 470 had 256MB RAM cache and the 860 Pro a whopping 4GB for the top model.

Although they started removing it entirely for NVMe SSDs, I guess the direct transfer speed is enough to not need a cache at all.

wtallis5y ago

The DRAM you're referring to is for the most part not a write cache for user data. Most of that DRAM is a read cache for the FTL's logical to physical address mapping table. When the FTL is working with the typical granularity of 4kB, you get a requirement of approximately 1GB of DRAM per 1TB of NAND.

Drives that include less than this amount of DRAM show reduced performance, usually in the form of lower random read performance because the physical address of the requested data cannot be quickly found by consulting a table in DRAM and must be located by first performing at least one slow NAND read.

mastax5y ago

NVMe drives can access system memory over the PCIe bus.

riobard5y ago· 4 in thread

One thing I'm still puzzled about SSD over-provisioning, which is also mentioned by the tutorial (https://codecapsule.com/2014/02/12/coding-for-ssds-part-4-ad...) recommended by the article:

> A drive can be over-provisioned simply by formatting it to a logical partition capacity smaller than the maximum physical capacity. The remaining space, invisible to the user, will still be visible and used by the SSD controller.

Does the controller read the partition table to decide that the space beyond logic partition is safe to use as scrap?

rdc125y ago

The SSD maintains a translation table for all the virtual addresses exposed by the drive, that maps to the underlying flash physical addresses. Any physical address not in that table, is unallocated and the drive can use freely.

riobard5y ago

So over-provisioning has to be done before any writes to the drive? What if I want to over-provision a used drive? Discard all blocks first?

1 more reply

ars5y ago

Any sector with nothing written on it can be used as scrap.

So if you partition the entire thing, but just never write to the full disk (you never use all the space), that also works as overprovisioning.

Partitioning just forces that to happen.

riobard5y ago

If I partition the entire drive, eventually all blocks will be used, depending on how the filesystem allocates, right? So to guarantee some free space it's better to over-provision by under-partitioning. Now how do I make sure that on a used drive?

2 more replies

kortilla5y ago· 3 in thread

The title should be “why SSDs mean programmers no longer have to think about hard drives”.

These are all reasons SSDs are much more pleasant to work with than old platter disks.

cbsmith5y ago

Well, they no longer need to think about hard disks, but there are a lot assumptions from the world of hard disks that play out very differently in the SSD world.

formerly_proven5y ago

I don't think there's any optimization for hard drives that is going to hurt on SSDs, and unoptimized workloads are always going to work better on SSDs. I'm inclined to agree with GP that SSDs are quite close to random-access storage and so there is little to worry about.

1 more reply

abledon5y ago

Why every programmer of a small subset of programmers who actually need to know this

dataflow5y ago· 3 in thread

What's the flash translation layer made of? Is the flash technology used for that more durable than the rest of the SSD itself? (like say MLC vs. QLC?)

pkaye5y ago

The FTL is like a virtual memory manager. It is firmware/hardware to manage things like the logical to physical mapping table, garbage collection, error correction, bad block management. Yes there will be a lot of FTL data structures stored on the flash. It can be made durable by redundant copies, writing in SLC mode or having recovery algorithms. I used to develop SSD firmware in the past if you have further questions.

jng5y ago

Hey that's very interesting! How much of the FTL logic is done with regular MCU code vs custom hardware? Is there any open source SSD firmware out there that one could look at to start experimenting in this field, or at least something pointing in that direction, be it open or affordable software, firmware, FPGA gateway or even IC IP? I believe there is value in integrating that part of the stack with the higher level software, but it seems quite difficult to experiment unless one is in the right circles / close to the right companies. Thanks!

1 more reply

SeanCline5y ago

You're right that the FTL has some durability concerns which, in addition to performance, is why it's typically cached in DRAM. Older DRAM-less SSDs were unreliable in the long-term but that's been improving with the adoption of HMB, which lets the SSD controller carve out some system RAM to store FTL data.

1_player5y ago· 2 in thread

A lot of talk about pages, but no mention about how big these pages are. From a quick look on Google, most SSDs have 4kB pages, with some reaching 8kB or even 16kB.

wtallis5y ago

SSDs mostly tell the host system that they have 512-byte sectors or sometimes 4kB sectors, and the typical flash translation layer works in 4kB sectors because that's a good fit for the kind of workloads coming from a host system that usually prefers to do things (eg. virtual memory) in 4kB chunks. But the underlying NAND flash page size has been 16kB for years.

cbsmith5y ago

...and all that cruft, and the logic to try to make handling of it not so bad, makes for a lot of complexity and unintended consequences.

1 more reply

mikewarot5y ago· 2 in thread

If you leave un-partitioned space on the SSD, how the heck does the SSD know it is ok to erase it? Wouldn't it be safer to partition it as an extra drive letter, format it, and then leave that drive alone? That would allow the OS to trim all the "empty" blocks.

qiqitori5y ago

Not 100% sure what you are replying to, and not sure what you meant by "safer", but this may help:

The actual physical address on the storage chip and the physical address from the operating system's perspective don't have much to do with another. For harddrives, "un-partitioned space" means that there is a physical "chunk of metal" that is unused.

However, that's not the case for SSDs. SSDs dynamically remap "OS-physical" block numbers to whatever they want. (Preferably addresses that have never been used before or that have been discarded/trimmed. If there aren't any available, perhaps to the address that was previously used for the same block number.)

mikewarot5y ago

>Not 100% sure what you are replying to, and not sure what you meant by "safer", but this may help:

I'm replying to the whole of comments on this article. The write amplification problem goes up as the number of "free" sectors/blocks goes down. Many solutions have been presented that don't allocate X% of the hard drive... but I'm not sure than any of them let the hard drive's SSD controller know they aren't allocated.

For that to happen, the OS has to have TRIM support, AND the block in question has to be on a volume that the OS is managing.

My worry is that if you have a blank partition, it's not being actively managed by anything, and thus isn't going to be TRIMed, and thus the SSD doesn't know the blocks are free for use.

Thus, leaving an unpartitioned area isn't going to help.

1 more reply

CoolGuySteve5y ago· 2 in thread

The claim about parallelism isn't true. Most benchmarks and my own experience show that sequential reads are still significantly faster than random reads on most NVME drives.

However, random read performance is only somewhere between a 3rd to half as fast as sequential compared to a magnetic disk where it's often 1/10th as fast.

pkaye5y ago

What kind of queue depth do you test the read performance? The sequential can be made fast at low queue depth by the SSD controller doing prefetch reads internally. I've worked on such algorithms myself.

CoolGuySteve5y ago

Show me a benchmark at any queue depth where random reads are as fast as the fastest sequential rate for that drive. It's simply not true.

I suspect it has something to do with prediction on the controller but I'm also not confidently spewing a bunch of bullshit about drive architecture unlike this article.

wly_cdgr5y ago· 2 in thread

There's nothing whatsoever I should need to know about SSDs as a Javascript programmer and if there is then the programmers on the lower levels haven't done their jobs right and are wasting my time

hddherman5y ago

Ever heard of leaky abstractions?

wly_cdgr5y ago

Sure, yeah...that's the "haven't done their jobs right and are wasting my time" part

BatteryMountain5y ago· 2 in thread

So.. interesting topic. Last year I experimented with some C# + Samsung 970 Evo Plus Nvme + MessagePack (with compression) + Zfs .. to benchmark how fast I could dump objects from .net memory to disk.

The numbers involved was insane and I played with various scenarios, with/without compression (MessagePack feature), with/without typeless serializer (MessagePack feature), with/without async and then the difference between using sync vs async and forcing disk flushes. I also weighed the difference between writing 1 fat file (append only) or millions of small files. I also checked the difference between using .net streams versus using File.WriteAllBytes (C# feature, an all-in-memory operation, good for small writes, bad for bigger files or async serialization + writing). I also played with the amount of objects involved (100K, 1M, 10M, 50M).

I cannot remember all the numbers involved, but I still have the code for all of it somewhere, so maybe I can write a blogpost about it. But I do remember being utttterly stunned about how fast it actually was to freeze my application state to disk and to thaw it again (the class name was Freezer :p).

The whole reason was, I started using Zfs and read up a bit about how it works. I also have some idea about how ssd's work. I also have some idea how serialization works and writing to disk works (streams etc).. I also have a rough idea how mysql, postgres, sql server save their datafiles to disk and what kind of compromises they make. So one day I was just sitting being frustrated with my data access layers and it dawned on me to try and build my own storage engine for fun, so I started by generating millions of objects that sits in memory, which I then serialized with MessagePack using a Parallel.Foreach (C# feature) to a samsung 970 evo plus to see how fast it would be. It blew my mind and I still don't trust that code enough to use it in production but it does work. Another reason why I tried it out, was because at work we have some postgres tables with 60m+ rows that are getting slow and I'm convinced we have a bad data model + too many indexes and that 60m rows are not too much (since then we've partitioned the hell out of it in multiple ways but that is a nightmare on its own since I still think we sliced the data the wrong way, according to my intuition and where the data has natural boundaries, time will tell who was right).

So I do believe there is a space in the industry where SSD's, paired with certain file systems, using certain file sizes and chunking, will completely leave sql databases in the dust, purely by the mechanism on how each of those things work together. I haven't put my code out in public yet and only told one other dev about it, mostly because it is basically sacrilege to go against the grain in our community and to say "I'm going to write my own database engine" sounds nuts even to me.

uncoder05y ago

I could really see your implementation being useful for game development I/O in unity which is C# native.

BatteryMountain5y ago

Totally.

I encourage anyone to go write their own little storage engine for fun. It will force you think about IO, Parallelization, Serialization, Streams, and backwards compatibility.

It is really fun (and not even that hard) and even if it works I still recommend against using it in production, but it will help take some of the magic away on how databases work and reveal the real challenge. The real difficult part for me comes from building a query language, parser and optimizer (like sql) and to handle concurrent writes properly. It is still difficult for me to comprehend how something like a sql query string gets converted into instructions that pull data out of a single file (say sqlite file), where that file's structure on disk can be messy and unknowable upfront when sqlite gets compiled. You essentially have a dynamic data structure and you are able to slice & order the data however you want, it is not known at compile time with hard-coded rules. So I think in that regard sql adds a ton of value. So sqlite is still my go to for most flat-file scenarios.

rossdavidh5y ago· 1 in thread

Interesting, and fun to read and think about! And, as a professional programmer for 17 years now, not once have I done anything where this would have been important for me to know (even if I had been running my code on a system with SSD's). So, I'm not convinced the title is at all accurate.

But, fun to read and think about.

cottsak5y ago

I think the key is hidden in > which can help creating software that is capable of exploiting them

Unless you're writing desktop software or your application behaves in a way where you have actually selected the particular hardware components (most of us in cloud hosting don't do this), you probably don't [need to] care.

dang5y ago· 1 in thread

What someone else said about that in 2014:

What every programmer should know about solid-state drives - https://news.ycombinator.com/item?id=9049630 - Feb 2015 (31 comments)

cottsak5y ago

haha! very similar sections too .. almost looked copied for a brief moment as i skimmed there

ropeladder5y ago· 1 in thread

If sequential and random reads are mostly the same on SSDs, does that make the distinction between columnar and row-based databases/data storage less important?

wtallis5y ago

Nope, unless your columns are all several kB wide. If you force the hardware to perform a multi-kB read for each 64-bit value you need, you're still going to waste a lot of potential performance.

rectang5y ago· 1 in thread

I wince at the amount of wear the `git clean -dxf; npm ci` cycle must be putting on my SSD.

githubalphapapa5y ago

If you're on Linux, libeatmydata might help reduce the number of writes hitting the SSD.

klodolph5y ago

If you care about SSDs, one paper you should read is “Don’t Stack Your Log on My Log” by Yang et al. 2014

https://www.usenix.org/system/files/conference/inflow14/infl...

> Log-structured applications and file systems have been used to achieve high write throughput by sequentializing writes. Flash-based storage systems, due to flash memory’s out-of-place update characteristic, have also relied on log-structured approaches. Our work investigates the impacts to performance and endurance in flash when multiple layers of log-structured applications and file systems are layered on top of a log-structured flash device. We show that multiple log layers affects sequentiality and increases write pressure to flash devices through randomization of workloads, unaligned segment sizes, and uncoordinated multi-log garbage collection. All of these effects can combine to negate the intended positive affects of using a log. In this paper we characterize the interactions between multiple levels of independent logs, identify issues that must be considered, and describe design choices to mitigate negative behaviors in multi-log configurations.

FpUser5y ago

It is really puzzling why "every programmer" should burden their already overloaded brains with this. If they're reading/writing some config/data files this knowledge would not help one bit. If they're using database then it falls to the database vendor's to optimize for this scenario.

So I think that unless this "every programmer" is a database storage engine developer (not too many of them I guess) their only concern would be mostly - how close my SSD to that magical point where it has to be cloned and replaced before shit hits the fan.

dan-robertson5y ago

See this paper from 2017, The unwritten contract of solid state drives: https://dl.acm.org/doi/10.1145/3064176.3064187

Agentlien5y ago

This reminds me of a recent interview[0] by Digital Foundry with the Core Technology Director of Ratchet and Clank: Rift Apart.

Near the beginning they talk about how targeting the PlayStation 5, which has an SSD, drastically changed how they went about making the game.

In short, the quick data transfer meant they were CPU bound rather than disk bound and could afford to have a lot of uncompressed data streamed directly into memory with no extra processing before use.

[0] https://youtu.be/-YpCQrPRpE0

2OEH8eoCRo05y ago

>Drives not Disks

And where did the word "drive" come from? I thought it referred to motors that spin the media, which SSDs also do not have.

DrNuke5y ago

A number of high-level techniques help rationalize data management and transfer, but the mileage of practical implementations may vary a lot. Generally speaking, only a small number of applications really need to take care and add a further layer of abstraction, that because the best practices already codified into any widespread language do an acceptable job already.

BrissyCoder5y ago

Why on earth do 99.5% of programmers even need to know what SSD stands for?

j / k navigate · click thread line to collapse

158 comments

99 comments · 25 top-level

bob10295y ago· 13 in thread

Things I have learned about SSDs:

If you want to go fast & save NAND lifetime, use append-only log structures.

pclmulqdq5y ago

A paper that I really like goes deeper into this: http://pages.cs.wisc.edu/~jhe/eurosys17-he.pdf

Edit 2: de-googled the link. Thank you for pointing that out.

10000truths5y ago

In either case, the advice given in the article and by the OP is filesystem agnostic.

3 more replies

trulyme5y ago

Degoogled link: http://pages.cs.wisc.edu/~jhe/eurosys17-he.pdf

gravypod5y ago

This is the "secret sauce" behind LevelDB: https://github.com/google/leveldb#performance

bob10295y ago

This looks to be a similar technique.

ww5205y ago

LMDB has similar write characteristics where its b-tree is append-only. This gives LMDB amazing performance and very robust ACID transaction support as immutability is baked in.

fulafel5y ago

1 more reply

remram5y ago

Shouldn't the OS or libc take care of that? If I write and don't immediately flush()?

KMag5y ago

I don't think most libc implementations take care to buffer to filesystem block/cluster boundaries.

AtlasBarfed5y ago

This is basically the purpose of rocksdb, and to a lesser extent Cassandra

senderista5y ago

hypertele-Xii5y ago

scns5y ago

Like this?

https://en.wikipedia.org/wiki/NILFS?wprov=sfla1

jedberg5y ago· 9 in thread

I've always been told, "just treat SSDs like slow, permanent memory".

ifdefdebug5y ago

jedberg5y ago

Not really though, because your kernel would most likely abstract that away and bunch up the writes.

1 more reply

gruez5y ago

1 more reply

ChrisArchitect5y ago

yes, it's a weak post

it's really about linking to the tutorial and papers it links at the end, which is some thing from 2014

And that was discussed here 6 years ago: https://news.ycombinator.com/item?id=9049630

fulafel5y ago

tffgg5y ago

Actually wear becomes increasingly more important as DRAM caches are removed to save money. And SSDs tend to have less write volume per unit

danbst5y ago

yeah, article should talk about periodic TRIMming, though this is more an admin advice

PaulKeeble5y ago

I have found trim is not sufficient at least on Windows, we still need to rarely defragment SSDs from what I can tell.

smt885y ago

Don't modern OSes transparently TRIM periodically anyway?

1 more reply

teddyh5y ago· 8 in thread

What everyone should know is that flash drives can lose their data when left unpowered for as little as three months.

crazygringo5y ago

Do you have a current source for that?

I've turned on plenty of cell phones that hadn't been charged or powered on for a couple of years and everything worked normally. Same with thumb drives I've picked up after years.

gruez5y ago

https://images.anandtech.com/doci/9248/2_575px.PNG

from https://www.anandtech.com/show/9248/the-truth-about-ssd-data...

1 more reply

mercora5y ago

wtallis5y ago

Three months is the minimum standard for data retention from an enterprise SSD that has used up its entire write endurance and reached end of life, but is still being stored in a hot chassis.

teddyh5y ago

Depending on manufacturer, and storage conditions, it can be up to about ten years. But the “three months” number is real: https://web.archive.org/web/20210502042514/http://www.dell.c...

1 more reply

anticensor5y ago

Yep, they are semivolatile limited write memory modules, not disks. Everyone should use that SV-LWMM acronym.

teddyh5y ago

AtlasBarfed5y ago

Is this an actual useful application if optane, replacing the memory with near-ram nonvolatile ?

andrewmcwatters5y ago· 7 in thread

I don't know what that would look like, but my guess would be that it would have something to do with average sized write caches, and those caches look entirely different today or something.

And today, there's probably some SSD specific code doing something out there now, too.

forrestthewoods5y ago

Games used to spend a lot time optimizing CD/DVD layout. Because reading from that is REALLY slow. Optimize mostly meant keep data contiguous. But sometimes it meant duplicate data to avoid seeks.

The canonical case is minimize time to load a level. Keep that level’s assets contiguous. And maybe duplicate data that is shared across levels. It’s a trade off between disc space and load time.

I’m not familiar with major tricks for improving after a disc is installed to drive. (PS4 games always streamed data from HDD, not disc.)

Even consoles use different HDD manufacturers. So it’d be pretty difficult to safely optimize for that. I’m sure a few games do. But it’s rare enough I’ve never heard of it.

fart325y ago

andrewmcwatters5y ago

patmorgan235y ago

alpaca1285y ago

1 more reply

rzzzt5y ago

You can optimize for less/shorter drive seeks on rotational media by reordering requests: https://en.wikipedia.org/wiki/Elevator_algorithm

hugey0105y ago

Right, the average programmer probably should, or already is, depending on some existing abstraction to optimizes writes based on storage medium.

rabuse5y ago· 7 in thread

cbsmith5y ago

It's an actual concern for you. For Apple it's a variant on planned obsolescence. ;-)

Note though that memory use metrics on MacOS can been a misleading. Make sure that you're seeing what's actually there.

ksec5y ago

Generally speaking macOS is extremely write heavy for all sort of reason even before the switch to ARM. But in majority of case if should last 4-5 years without problem.

The heavy write bug Apple said was due to misreporting and was fixed ( so they say ).

Grazester5y ago

Why did you get the 8 gig version? If you are using all this swap then your purchased the wrong MacBook.

rabuse5y ago

Honestly, don't run much, so didn't think it would be that bad stepping down from my 16GB machine.

1-65y ago

From what I’ve been able to gather, the excessive paging may actually have to do with non-native apps running on the M1. Avoid those.

rabuse5y ago

Most of my programs are JetBrains IDE's and browsers. Don't know if they're optimized for M1.

1 more reply

raihansaputra5y ago

I think the excessive wear was caused by a bug. Try upgrading to the .4 release.

personjerry5y ago· 6 in thread

wtallis5y ago

opencl5y ago

It varies quite a bit. There are two different types of caches: SLC and DRAM. Most drives use SLC caching, higher end drives often use both.

Typically the SSDs with DRAM have a ratio of 1GB DRAM per TB of flash.

igg5y ago

bserge5y ago

On SSDs? 32 is way off, the Samsung 470 had 256MB RAM cache and the 860 Pro a whopping 4GB for the top model.

Although they started removing it entirely for NVMe SSDs, I guess the direct transfer speed is enough to not need a cache at all.

wtallis5y ago

mastax5y ago

NVMe drives can access system memory over the PCIe bus.

riobard5y ago· 4 in thread

One thing I'm still puzzled about SSD over-provisioning, which is also mentioned by the tutorial (https://codecapsule.com/2014/02/12/coding-for-ssds-part-4-ad...) recommended by the article:

Does the controller read the partition table to decide that the space beyond logic partition is safe to use as scrap?

rdc125y ago

riobard5y ago

So over-provisioning has to be done before any writes to the drive? What if I want to over-provision a used drive? Discard all blocks first?

1 more reply

ars5y ago

Any sector with nothing written on it can be used as scrap.

So if you partition the entire thing, but just never write to the full disk (you never use all the space), that also works as overprovisioning.

Partitioning just forces that to happen.

riobard5y ago

2 more replies

kortilla5y ago· 3 in thread

The title should be “why SSDs mean programmers no longer have to think about hard drives”.

These are all reasons SSDs are much more pleasant to work with than old platter disks.

cbsmith5y ago

Well, they no longer need to think about hard disks, but there are a lot assumptions from the world of hard disks that play out very differently in the SSD world.

formerly_proven5y ago

1 more reply

abledon5y ago

Why every programmer of a small subset of programmers who actually need to know this

dataflow5y ago· 3 in thread

What's the flash translation layer made of? Is the flash technology used for that more durable than the rest of the SSD itself? (like say MLC vs. QLC?)

pkaye5y ago

jng5y ago

1 more reply

SeanCline5y ago

1_player5y ago· 2 in thread

A lot of talk about pages, but no mention about how big these pages are. From a quick look on Google, most SSDs have 4kB pages, with some reaching 8kB or even 16kB.

wtallis5y ago

cbsmith5y ago

...and all that cruft, and the logic to try to make handling of it not so bad, makes for a lot of complexity and unintended consequences.

1 more reply

mikewarot5y ago· 2 in thread

qiqitori5y ago

Not 100% sure what you are replying to, and not sure what you meant by "safer", but this may help:

mikewarot5y ago

>Not 100% sure what you are replying to, and not sure what you meant by "safer", but this may help:

For that to happen, the OS has to have TRIM support, AND the block in question has to be on a volume that the OS is managing.

My worry is that if you have a blank partition, it's not being actively managed by anything, and thus isn't going to be TRIMed, and thus the SSD doesn't know the blocks are free for use.

Thus, leaving an unpartitioned area isn't going to help.

1 more reply

CoolGuySteve5y ago· 2 in thread

The claim about parallelism isn't true. Most benchmarks and my own experience show that sequential reads are still significantly faster than random reads on most NVME drives.

However, random read performance is only somewhere between a 3rd to half as fast as sequential compared to a magnetic disk where it's often 1/10th as fast.

pkaye5y ago

CoolGuySteve5y ago

Show me a benchmark at any queue depth where random reads are as fast as the fastest sequential rate for that drive. It's simply not true.

I suspect it has something to do with prediction on the controller but I'm also not confidently spewing a bunch of bullshit about drive architecture unlike this article.

wly_cdgr5y ago· 2 in thread

There's nothing whatsoever I should need to know about SSDs as a Javascript programmer and if there is then the programmers on the lower levels haven't done their jobs right and are wasting my time

hddherman5y ago

Ever heard of leaky abstractions?

wly_cdgr5y ago

Sure, yeah...that's the "haven't done their jobs right and are wasting my time" part

BatteryMountain5y ago· 2 in thread

uncoder05y ago

I could really see your implementation being useful for game development I/O in unity which is C# native.

BatteryMountain5y ago

Totally.

I encourage anyone to go write their own little storage engine for fun. It will force you think about IO, Parallelization, Serialization, Streams, and backwards compatibility.

rossdavidh5y ago· 1 in thread

But, fun to read and think about.

cottsak5y ago

I think the key is hidden in > which can help creating software that is capable of exploiting them

dang5y ago· 1 in thread

What someone else said about that in 2014:

What every programmer should know about solid-state drives - https://news.ycombinator.com/item?id=9049630 - Feb 2015 (31 comments)

cottsak5y ago

haha! very similar sections too .. almost looked copied for a brief moment as i skimmed there

ropeladder5y ago· 1 in thread

If sequential and random reads are mostly the same on SSDs, does that make the distinction between columnar and row-based databases/data storage less important?

wtallis5y ago

Nope, unless your columns are all several kB wide. If you force the hardware to perform a multi-kB read for each 64-bit value you need, you're still going to waste a lot of potential performance.

rectang5y ago· 1 in thread

I wince at the amount of wear the `git clean -dxf; npm ci` cycle must be putting on my SSD.

githubalphapapa5y ago

If you're on Linux, libeatmydata might help reduce the number of writes hitting the SSD.

klodolph5y ago

If you care about SSDs, one paper you should read is “Don’t Stack Your Log on My Log” by Yang et al. 2014

https://www.usenix.org/system/files/conference/inflow14/infl...

FpUser5y ago

dan-robertson5y ago

See this paper from 2017, The unwritten contract of solid state drives: https://dl.acm.org/doi/10.1145/3064176.3064187

Agentlien5y ago

This reminds me of a recent interview[0] by Digital Foundry with the Core Technology Director of Ratchet and Clank: Rift Apart.

Near the beginning they talk about how targeting the PlayStation 5, which has an SSD, drastically changed how they went about making the game.

[0] https://youtu.be/-YpCQrPRpE0

2OEH8eoCRo05y ago

>Drives not Disks

And where did the word "drive" come from? I thought it referred to motors that spin the media, which SSDs also do not have.

DrNuke5y ago

BrissyCoder5y ago

Why on earth do 99.5% of programmers even need to know what SSD stands for?

j / k navigate · click thread line to collapse