IO Devices and Latency (opens in new tab)

(planetscale.com)

443 pointsmilar1y ago153 comments

153 comments

96 comments · 28 top-level

bddicken1y ago· 28 in thread

Author of the blog here. I had a great time writing this. By far the most complex article I've ever put together, with literally thousands of lines of js to build out these interactive visuals. I hope everyone enjoys.

tealpod1y ago

Your style of explaination and animation are exceptional.

jasonthorsness1y ago

The visuals are awesome; the bouncing-box is probably the best illustration of relative latency I've seen.

Your "1 in a million" comment on durability is certainly too pessimistic once you consider the briefness of the downtime before a new server comes in and re-replicates everything, right? I would think if your recovery is 10 minutes for example, even if each of three servers is guaranteed to fail once in the month, I think it's already like 1 in two million? and if it's a 1% chance of failure in the month failure of all three overlapping becomes extremely unlikely.

Thought I would note this because one-in-a-million is not great if you have a million customers ;)

bddicken1y ago

> Your "1 in a million" comment on durability is certainly too pessimistic once you consider the briefness of the downtime before a new server comes in and re-replicates everything, right?

Absolutely. Our actual durability is far, far, far higher than this. We believe that nobody should ever worry about losing their data, and thats the peace of mind we provide.

1 more reply

the_arun1y ago

Kudos to whoever patiently & passionately built these. On an off topic - This is a great perspective for building realistic course work for middle & high school students. I'm sure they learn faster & better with visuals like these.

1 more reply

mixermachine1y ago

1 in a million is the probability that all three servers die in one months, without swapping out the broken ones. So at some point in the month all the data is gone.

If you replace the failed(or failing) node right away, the failure percentage goes down greatly. You would likely need the probability of a node going done in 30 minutes time space. Assuming the migration can be done in 30 min.

(i hope this calculation is correct)

If 1% probability per month then 1%/(43800/30) = (1/1460)% probability per 30 min.

For three instances: (1/1460)% * (1/1460)% * (1/1460)% = (1/3112136000)% probability per 30 min that all go down.

Calculated for one month (1/3112136000)% * (43800/30) = (1/2131600)%

So one in 213 160 000 that all three servers go down in a 30 minute time span somewhere in one month. After the 30 minutes another replica will already be available, making the data safe.

I'm happy to be corrected. The probability course was some years back :)

1 more reply

b0rbb1y ago

The animations are fantastic and awesome job with the interactivity. I find myself having to explain latency to folks often in my work and being able to see the extreme difference in latencies for something like a HDD vs SSD makes it much easier to understand for some people.

Edit: And for real, fantastic work, this is awesome.

bddicken1y ago

Thank you! The visuals definitely add something special to this post specifically since time is a big element in explaining latencies.

1 more reply

zalebz1y ago

The level of your effort really shows through. If you had to ballpark guess, how much time do you think you put in? and I realize keyboard time vs kicking around in your head time are quite different

bddicken1y ago

Thank you! I started this back in October, but of course have worked on plenty of other things in the meantime. But this was easily 200+ hours of work spread out over that time.

If this helps as context, the git diff for merging this into our website was: +5,820 −1

dormando1y ago

Half on topic: what libs/etc did you use for the animations? Not immediately obvious from the source page.

(it's a topic I'm deeply familiar with so I don't have a comment on the content, it looks great on a skim!) - but I've been sketching animations for my own blog and not liked the last few libs I tried.

Thanks!

bddicken1y ago

I heavily, heavily abused d3.js to build these.

1 more reply

anymouse1234561y ago

I love this kind of datavis.

We are generally bad at internalizing comparisons at these scales. The visualizations make a huge difference in building more detailed intuitions.

Really nice work, thank you!

bddicken1y ago

Yeah I think the visuals really add to this one, especially given the time element of explaining latencies.

hakaneskici1y ago

Great work! Thank you for making this.

This is beautiful and brilliant, and also is a great visual tool to explain how some of the fundamental algorithms and data structures originate from the physical characteristics of storage mediums.

I wonder if anyone remembers the old days where you programmed your own custom defrag util to place your boot libs and frequently used apps to the outer tracks of the hard drive, so they are loaded faster due to the higher linear velocity of the outermost track :)

AlphaWeaver1y ago

Were you at all inspired by the work of Bartosz Ciechanowski? My first thought was that you all might have hired him to do the visuals for this post :)

bddicken1y ago

Bartosz Ciechanowski is incredible at this type of stuff. Sam Rose has some great interactive blogs too. Both have had big hits here on HN.

hodgesrm1y ago

I was delighted to see your models of tape operations as I used it a lot in the COBOL days.

For reasons discussed in your article we would arrange tape processing as much as possible in sequential scans, something at which COBOL was quite excellent. One of the common performance problems was when there was a mismatch between a slower COBOL processing speed that could not keep up with the flow of blocks coming off the drive head.

In this case you would see the drive start to overshoot as it read more blocks than the COBOL program could handle. The drive would begin a painful jump forward/spool backward motion which made the performance issue quite visible. You would then eyeball the code to understand way the program was not keeping up, correct, and resubmit until the motion disappeared.

logsr1y ago

Amazing presentation. It really helps to understand the concepts.

The only add is that it understates the impact of SSD parallelism. 8 Channel controllers are typical for high end devices and 4K random IOPS continue to scale with queue depth, but for an introduction the example is probably complex enough.

It is great to see PlanetScale moving in this direction and sharing the knowledge.

bddicken1y ago

Thank you for the info! Do you have any good references on this for those who want to learn more?

1 more reply

alexellisuk1y ago

Hi, what actually are _metal_ instances that are being used when you're on EC2 that have local NVME attached? Last time I looked, apart from the smallest/slowest Graviton, you have to spend circa 2.3k USD/mo to get a bare-metal instance from AWS - https://blog.alexellis.io/how-to-run-firecracker-without-kvm...

lizztheblizz1y ago

Hi there, PS employee here. In AWS, the instance types backing our Metal class are currently in the following families: r6id, i4i, i3en and i7ie. We're deploying across multiple clouds, and our "Metal" product designation has no direct link to Amazon's bare-metal offerings.

tombert1y ago

The visualizations are excellent, very fun to look at and play with, and they go along with the article extremely well. You should be proud of this, I really enjoyed it.

bddicken1y ago

Thank you!

layer81y ago

I don’t see any animations on Safari. Also, I’d much prefer a variable-width font, monospace prose is hard to read. While I can use Reader Mode, that removes the text coloring, and would likely also hide the visuals (if they were visible in the first place).

bddicken1y ago

Interesting! Any errors you can report? Should work in safari but maybe you have something custom going on, or an older version?

1 more reply

inetknght1y ago

I don't see a single visual. I don't use the web with javascript. Why not embed static images instead or in addition?

bddicken1y ago

The visuals add a lot to this article. A big theme throughout is latency, and the visual help the reader see why tape is slower than an hdd, which is slower than an ssd, etc. Also, its just plain fun!

I'm curious, what do you do on the internet without js these days?

1 more reply

vel0city1y ago

They're not just static images or animations, they're interactive widgets.

bob10291y ago· 15 in thread

I've been advocating for SQLite+NVMe for a while now. For me it is a new kind of pattern you can apply to get much further into trouble than usual. In some cases, you might actually make it out to the other side without needing to scale horizontally.

Latency is king in all performance matters. Especially in those where items must be processed serially. Running SQLite on NVMe provides a latency advantage that no other provider can offer. I don't think running in memory is even a substantial uplift over NVMe persistence for most real world use cases.

crazygringo1y ago

> I've been advocating for SQLite+NVMe for a while now.

Why SQLite instead of a traditional client-server database like Postgres? Maybe it's a smidge faster on a single host, but you're just making it harder for yourself the moment you have 2 webservers instead of 1, and both need to write to the database.

> Latency is king in all performance matters.

This seems misleading. First of all, your performance doesn't matter if you don't have consistency, which is what you now have to figure out the moment you have multiple webservers. And secondly, database latency is generally miniscule compared to internet round-trip latency, which itself is miniscule compared to the "latency" of waiting for all page assets to load like images and code libraries.

> Especially in those where items must be processed serially.

You should be avoiding serial database queries as much as possible in the first place. You should be using joins whenever possible instead of separate queries, and whenever not possible you should be issuing queries asynchronously at once as much as possible, so they execute in parallel.

bob10291y ago

The entire point is to avoid the network hop.

Application <-> SQLite <-> NVMe

has orders of magnitude less latency than

Application <-> Postgres Client <-> Network <-> Postgres Server <-> NVMe

> You should be avoiding serial database queries as much as possible in the first place.

I don't get to decide this. The business does.

4 more replies

conradev1y ago

Until you hit the single-writer limitation in SQLite, you do not need to spend more CPU cycles on Postgres

2 more replies

dangoodmanUT1y ago

The SQLite filesystem is laid out to hedge against HDD defragging. It wouldn't benefit as much as changing it to a more modern layout that's SSD-native, then using NVMe

cynicalsecurity1y ago

Sqlite doesn't work super well with parallelism in writing. It supports it, yes, but in a bit clunky way and it still can fail. To avoid problems with parallel writing besides setting a specific clunky mode of operations a trick of using a single thread for writing in an app can be used. Which usually makes the already complicated parallel code slightly more complicated.

If only one thread of writing is required, then SQLite works absolutely great.

bob10291y ago

> If only one thread of writing is required, then SQLite works absolutely great.

The whole point of getting your commands down to microsecond execution time is so that you can get away with just one thread of writing.

Entire financial exchanges operate on this premise.

1 more reply

jstimpfle1y ago

I still measure 1-2ms of latency with an NVMe disk on my Desktop computer, doing fsync() on a file on a ext4 filesystem.

Update: about 800us on a more modern system.

rbranson1y ago

Not so sure that's true. This is single-threaded direct I/O doing a fio randwrite workload on a WD 850X Gen4 SSD:

    write: IOPS=18.8k, BW=73.5MiB/s (77.1MB/s)(4412MiB/60001msec); 0 zone resets
    slat (usec): min=2, max=335, avg= 3.42, stdev= 1.65
    clat (nsec): min=932, max=24868k, avg=49188.32, stdev=65291.21
     lat (usec): min=29, max=24880, avg=52.67, stdev=65.73
    clat percentiles (usec):
     |  1.00th=[   33],  5.00th=[   34], 10.00th=[   34], 20.00th=[   35],
     | 30.00th=[   37], 40.00th=[   38], 50.00th=[   40], 60.00th=[   43],
     | 70.00th=[   53], 80.00th=[   60], 90.00th=[   70], 95.00th=[   84],
     | 99.00th=[  137], 99.50th=[  174], 99.90th=[  404], 99.95th=[  652],
     | 99.99th=[ 2311]

3 more replies

madisp1y ago

I'm not an expert, but I think an enterprise NVMe will have some sort of power loss protection so it can afford to fsync to ram/caches as they will be written down in a power loss. Consumer NVMe drives afaik lack this so fsync will force the file to be written.

dogben1y ago

I believe that's power saving in action. A single operation at idle is slow, the drive needs time to wake from idle.

dzr00011y ago

What drive is this and does it need a trim? Not all NVMe devices are created equal, especially in consumer drives. In a previous role I was responsible for qualifying drives. Any datacenter or enterprise class drive that had that sort of latency in direct IO write benchmarks after proper pre-conditioning would have failed our validation.

1 more reply

the84721y ago

I assume fsyncing a whole file does more work than just ensuring that specific blocks made it to the WAL which it can achieve with direct IO or maybe sync_file_range.

delamon1y ago

Enterprise NVMe can do fsync much faster than consumer hardware. This is because they can cheat and report successful fsync() before data actually had been flushed to flash. They have backup capacitors which allow them to flush caches in case of power loss, so no data loss.

Here PM983 doing `fio --name=fsync_test --ioengine=sync --rw=randwrite --bs=4k --size=1G --numjobs=1 --runtime=10s --time_based --fsync=1`

  Jobs: 1 (f=1): [w(1)][100.0%][w=183MiB/s][w=46.7k IOPS][eta 00m:00s]
  fsync_test: (groupid=0, jobs=1): err= 0: pid=11905: Fri Mar 14 13:34:34 2025
    write: IOPS=39.1k, BW=153MiB/s (160MB/s)(1527MiB/10001msec); 0 zone resets
      clat (nsec): min=1052, max=223288, avg=1606.69, stdev=2345.64
       lat (nsec): min=1082, max=223458, avg=1653.08, stdev=2346.58
      clat percentiles (nsec):
       |  1.00th=[  1128],  5.00th=[  1176], 10.00th=[  1240], 20.00th=[  1320],
       | 30.00th=[  1448], 40.00th=[  1496], 50.00th=[  1528], 60.00th=[  1576],
       | 70.00th=[  1640], 80.00th=[  1720], 90.00th=[  1816], 95.00th=[  1960],
       | 99.00th=[  2576], 99.50th=[  3376], 99.90th=[ 10816], 99.95th=[ 32640],
       | 99.99th=[124416]
     bw (  KiB/s): min=123168, max=190568, per=99.00%, avg=154788.63, stdev=19610.50, samples=19
     iops        : min=30792, max=47642, avg=38697.16, stdev=4902.62, samples=19
    lat (usec)   : 2=95.61%, 4=4.10%, 10=0.19%, 20=0.04%, 50=0.03%
    lat (usec)   : 100=0.02%, 250=0.01%
    fsync/fdatasync/sync_file_range:
      sync (usec): min=13, max=1238, avg=23.08, stdev= 9.27
      sync percentiles (usec):
       |  1.00th=[   15],  5.00th=[   16], 10.00th=[   16], 20.00th=[   17],
       | 30.00th=[   18], 40.00th=[   25], 50.00th=[   26], 60.00th=[   26],
       | 70.00th=[   26], 80.00th=[   26], 90.00th=[   26], 95.00th=[   27],
       | 99.00th=[   34], 99.50th=[   79], 99.90th=[  101], 99.95th=[  126],
       | 99.99th=[  347]

The same test on SN850X

  Jobs: 1 (f=1): [w(1)][100.0%][w=22.9MiB/s][w=5859 IOPS][eta 00m:00s]
  fsync_test: (groupid=0, jobs=1): err= 0: pid=23328: Fri Mar 14 13:35:04 2025
    write: IOPS=5742, BW=22.4MiB/s (23.5MB/s)(224MiB/10001msec); 0 zone resets
      clat (nsec): min=400, max=110253, avg=797.80, stdev=1244.19
       lat (nsec): min=430, max=110273, avg=826.49, stdev=1248.86
      clat percentiles (nsec):
       |  1.00th=[  502],  5.00th=[  540], 10.00th=[  572], 20.00th=[  612],
       | 30.00th=[  644], 40.00th=[  668], 50.00th=[  708], 60.00th=[  748],
       | 70.00th=[  804], 80.00th=[  868], 90.00th=[ 1032], 95.00th=[ 1176],
       | 99.00th=[ 1560], 99.50th=[ 2224], 99.90th=[ 8384], 99.95th=[23424],
       | 99.99th=[66048]
     bw (  KiB/s): min=19800, max=24080, per=100.00%, avg=23004.21, stdev=1039.13, s  amples=19
     iops        : min= 4950, max= 6020, avg=5751.05, stdev=259.78, samples=19
    lat (nsec)   : 500=0.80%, 750=58.72%, 1000=29.04%
    lat (usec)   : 2=10.89%, 4=0.28%, 10=0.18%, 20=0.04%, 50=0.04%
    lat (usec)   : 100=0.01%, 250=0.01%
    fsync/fdatasync/sync_file_range:
      sync (usec): min=136, max=28040, avg=172.88, stdev=195.00
      sync percentiles (usec):
       |  1.00th=[  145],  5.00th=[  149], 10.00th=[  151], 20.00th=[  151],
       | 30.00th=[  159], 40.00th=[  159], 50.00th=[  159], 60.00th=[  159],
       | 70.00th=[  159], 80.00th=[  161], 90.00th=[  198], 95.00th=[  202],
       | 99.00th=[  396], 99.50th=[  416], 99.90th=[  594], 99.95th=[ 1467],
     | 99.99th=[ 5145]

kev0091y ago

NVMe is just a protocol. There are drives that are absolute shit and others that cost as much as luxury automobiles. In either case not quite DRAM latency because it is expansion bus attached.

1 more reply

sergiotapia1y ago

I had a lot of fun with Coolify running my app and my database on the same machine. It was pretty cool to see zero latency in my SQL queries, just the cost of the engine.

__turbobrew__1y ago· 6 in thread

I think something about distributed storage which is not appreciated in this article:

1. Some systems do not support replication out of the box. Sure your cassandra cluster and mysql can do master slave replication, but lots of systems cannot.

2. Your life becomes much harder with NVME storage in cloud as you need to respect maintenance intervals and cloud initiated drains. If you do not hook into those system and drain your data to a different node, the data goes poof. Separating storage from compute allows the cloud operator to drain and move around compute as needed and since the data is independent from the compute — and the cloud operator manages that data system and draining for that system as well — the operator can manage workload placements without the customer needing to be involved.

rcrowley1y ago

Good points. PlanetScale's durability and reliability are built on replication - MySQL replication - and all the operational software we've written to maintain replication in the face of servers coming and going, network partitions, and all the rest of the weather one faces in the cloud.

Replicated network-attached storage that presents a "local" filesystem API is a powerful way to create durability in a system that doesn't build it in like we have.

__turbobrew__1y ago

Agreed, if you are a mature enough and well funded organization you probably should be using NVME and then run distributed systems on top of the NVMEs to manage replication yourself.

3921y ago

This is where s2.dev could in theory come to the rescue. Able to keep up with the streaming bandwidth, but durable.

wmf1y ago

I assume DRBD still exists although it's certainly easier to use EBS.

maayank1y ago

what do you mean by drains?

rcrowley1y ago

AWS, for one example, provide a feed of upcoming "events" in EC2 in which certain instances will need to be rebooted or terminated entirely due to whatever maintenance they're doing on the physical infrastructure.

If you miss a termination event you miss your chance to copy that data elsewhere. Of course, if you're _always_ copying the data elsewhere, you can rest easy.

ucarion1y ago· 4 in thread

Really, really great article. The visualization of random writes is very nicely done.

On:

> Another issue with network-attached storage in the cloud comes in the form of limiting IOPS. Many cloud providers that use this model, including AWS and Google Cloud, limit the amount of IO operations you can send over the wire. [...]

> If instead you have your storage attached directly to your compute instance, there are no artificial limits placed on IO operations. You can read and write as fast as the hardware will allow for.

I feel like this might be a dumb series of questions, but:

1. The ratelimit on "IOPS" is precisely a ratelimit on a particular kind of network traffic, right? Namely traffic to/from an EBS volume? "IOPS" really means "EBS volume network traffic"?

2. Does this save me money? And if yes, is from some weird AWS arbitrage? Or is it more because of an efficiency win from doing less EBS networking?

I see pretty clearly putting storage and compute on the same machine strictly a latency win, because you structurally have one less hop every time. But is it also a throughput-per-dollar win too?

rbranson1y ago

> 1. The ratelimit on "IOPS" is precisely a ratelimit on a particular kind of network traffic, right? Namely traffic to/from an EBS volume? "IOPS" really means "EBS volume network traffic"?

The EBS volume itself has a provisioned capacity of IOPS and throughput, and the EC2 instance it's attached to will have its own limits as well across all the EBS volumes attached to it. I would characterize it more like a different model. An EBS volume isn't just just a slice of a physical PCB attached to a PCIe bus, it's a share in a large distributed system a large number of physical drives with its own dedicated network capacity to/from compute, like a SAN.

> 2. Does this save me money? And if yes, is from some weird AWS arbitrage? Or is it more because of an efficiency win from doing less EBS networking?

It might. It's a set of trade-offs.

ucarion1y ago

That makes sense. The weirdness of https://docs.aws.amazon.com/ebs/latest/userguide/ebs-io-char... makes more sense now. Reminds me of DynamoDB capacity units.

the84721y ago

For network-attached storage IOPS limits packets per second, not bandwidth, since IO operations can happen at different sizes (e.g. 4K vs. 16K blocks).

rbranson1y ago

More specific details for EC2 instances can be seen in the docs here: https://docs.aws.amazon.com/ec2/latest/instancetypes/gp.html...

gozzoo1y ago· 3 in thread

Can someeone share their expirience in creating such diagrams. What libraries and tools can be useful for such interactive diagrams?

bddicken1y ago

For this particular one I used d3.js, but honestly this isn't really the type of thing it's designed for. I've also used GSAP for this type of thing on this article I wrote about database sharding.

https://planetscale.com/blog/database-sharding

Nezteb1y ago

Your diagrams are fantastic; I'd enjoy a short blog post or snippet just on how you used D3 for this!

As someone who has also use GSAP a decent amount, these days I usually have a better experience with SVG.js [1].

[1] https://github.com/svgdotjs/svg.js

Joel_Mckay1y ago

Do you mean something for data visualization, or tricks condensing large data sets with cursors?

https://d3js.org/

Best of luck =3

CSDude1y ago· 2 in thread

For years, I just didn't get why replicated databases always stick with EBS and deal with its latency. Like, replication is already there, why not be brave and just go with local disks? At my previous orgs, where we ran Elasticsearch for temporary logs/metrics storage, I proposed we do exactly that since we didn't even have major reliability requirements. But I couldn't convince them back then, we ended up with even worse AWS Elasticsearch.

I get that local disks are finite, yeah, but I think the core/memory/disk ratio would be good enough for most use cases, no? There are plenty of local disk instances with different ratios as well, so I think a good balance could be found. You could even use local hard disk ones with 20TB+ disks for implementing hot/cold storage.

Big kudos to the PlanetScale team, they're like, finally doing what makes sense. I mean, even AWS themselves don't run Elasticsearch on local disks! Imagine running ClickHouse, Cassandra, all of that on local disks.

jiggawatts1y ago

I looked into this with an idea of running SQL Server Availability Groups on the Azure Las_v3 series VMs, which have terabytes of local SSD.

The main issue was that after a stop-start event, the disks are wiped. SQL Server can’t automatically handle this, even if the rest of the cluster is fine and there are available replicas. It won’t auto repair the node that got reset. The scripting and testing required to work around this would be unsupportable in production for all but the bravest and most competent orgs.

hodgesrm1y ago

There are a number of axes of performance that aren't covered in this [wonderful] article on storage performance. One of these is that EBS allows you to scale the VM up / down to change the amount of CPU & RAM available to process data on disk. We run several hundred ClickHouse clusters on this model. Rescaling to address performance issues is far more common than failures.

Example; you get a tenant performance issue on Sunday morning US time. The simplest fix is often rescale to a larger VM for the weekend, then get the A team working on the root cause first thing Monday. The incremental cost is minimal and avoids far more costly staff burnout.

pjdesno1y ago· 2 in thread

I love the visuals, and if it's ok with you will probably link them to my class material on block devices in a week or so.

One small nit: > A typical random read can be performed in 1-3 milliseconds.

Um, no. A 7200 RPM platter completes a rotation in 8.33 milliseconds, so rotational delay for a random read is uniformly distributed between 0 and 8.33ms, i.e. mean 4.16ms.

>a single disk will often have well over 100,000 tracks

By my calculations a Seagate IronWolf 18TB has about 615K tracks per surface given that it has 9 platters and 18 surfaces, and an outer diameter read speed of about 260MB/s. (or 557K tracks/inch given typical inner and outer track diameters)

For more than you ever wanted to know about hard drive performance and the mechanical/geometrical considerations that go into it, see https://www.msstconference.org/MSST-history/2024/Papers/msst...

bddicken1y ago

Whoah, thanks for sharing the paper.

pjdesno1y ago

I reviewed it three times for different conferences :-)

I’m still annoyed they didn’t include the drain time equation I used for calculating track width, which falls out of one of their equations.

Oh, and I’m very glad you showed differing track sizes across the platter. (BTW, did you know track sizes differ between platters? Google “disks are like snowflakes”)

bloopernova1y ago· 2 in thread

Fantastic article, well explained and beautiful diagrams. Thank you bddicken for writing this!

bddicken1y ago

You are welcome!

TechDebtDevin1y ago

Probably the best diagrams I've ever seen in a blog post.

robotguy1y ago· 1 in thread

Seeing the disk IO animation reminded me of Melvin Kaye[0]:

  Mel never wrote time-delay loops, either, even when the balky Flexowriter
  required a delay between output characters to work right.
  He just located instructions on the drum
  so each successive one was just past the read head when it was needed;
  the drum had to execute another complete revolution to find the next instruction.

[0] https://pages.cs.wisc.edu/~markhill/cs354/Fall2008/notes/The...

Thoreandan1y ago

I was reminded of Mel as well! If you haven't seen it, Usagi Electric on YouTube has gotten a drum-memory system from the 1950s nearly fully-functional again.

jhgg1y ago· 1 in thread

Metal looks super cool, however at my last job when we tried using instance local SSD's on GCP, there were serious reliability issues (e.g. blocks on the device losing data). Has this situation changed? What machine types are you using?

Our workaround was this: https://discord.com/blog/how-discord-supercharges-network-di...

rcrowley1y ago

Neat workaround! We only started working with GCP Local SSDs in 2024 and can report we haven't experienced read or write failures due to bad sectors in any of our testing.

That said, we're running a redundant system in which MySQL semi-sync replication ensures every write is durable to two machines, each in a different availability zone, before that write's acknowledged to the client. And our Kubernetes operator plus Vitess' vtorc process are working together to aggressively detect and replace failed or even suspicious replicas.

In GCP we find the best results on n2d-highmem machines. In AWS, though, we run on pretty much all the latest-generation types with instance storage.

gz091y ago· 1 in thread

Nice blog. There is also a problem that generally cloud storage is "just unusually slow" (this has been noted by others before, but here is a nice summary of the problem http://databasearchitects.blogspot.com/2024/02/ssds-have-bec...)

Having recently added support for storing our incremental indexes in https://github.com/feldera/feldera on S3/object storage (we had NVMe for longer due to obvious performance advantages mentioned in the previous article), we'd be happy for someone to disrupt this space with a better offering ;).

bddicken1y ago

That database architects blog is a great read.

_1tem1y ago· 1 in thread

If this is true, then how do "serverless" database providers like Neon advertise "low latency" access? They use object storage like S3, which I imagine is an order of magnitude worse than networked storage for latency.

edit: apparently they build a kafkaesque layer of caching. No thank you, I'll just keep my data on locally attached NVMe.

hodgesrm1y ago

> edit: apparently they build a kafkaesque layer of caching. No thank you, I'll just keep my data on locally attached NVMe.

I can't speak to Neon specifically but I've worked a lot with analytic databases, which often use NVMe SSD caches to operate efficiently on S3 data. For time-ordered datasets like observability (e.g., metrics) most queries go to recent data which in the steady state is not just in NVMe SSD storage but generally RAM as well if you are properly tuned. For example, indexes and other metadata are permanently cached.

In realistic tests of the above scenario the effect of nVME SSD can be surprisingly muted. That's especially true if you can use clusters that spread processing across multiple compute nodes, which gives you more RAM to play with and also multiplies storage bandwith.

There are downsides to S3 of course like restarts, which require management to avoid performance issues.

vessenes1y ago· 1 in thread

Great nerdbaiting ad. I read all the way to the bottom of it, and bookmarked it to send to my kids if I feel they are not understanding storage architectures properly. :)

bddicken1y ago

The nerdbaiting will now provide generational benefit!

aftbit1y ago· 1 in thread

Hrm "unlimited IOPS"? I suppose contrasted against the abysmal IOPS available to Cloud block devs. A good modern NVMe enterprise drive is specced for (order of magnitude) 10^6 to 10^7 IOPS. If you can saturate that from database code, then you've got some interesting problems, but it's definitely not unlimited.

bddicken1y ago

Technically any drive has a finite IOPS capacity. We have found that no matter how hard we tried, we could not get MySQL to exhaust the max IOPS of the underlying hardware. You hit CPU limits long before hitting IOPS limits. Thus "infinite IOPS."

magicmicah851y ago

Can I just say that I love how informative this was that I completely forgot it was to promote a product? Excellent visuals and interactivity.

tonyhb1y ago

This is really cool, and PlanetScale Metal looks really solid, too. Always a huge sucker for seeing latency huge latency drops on releases: https://planetscale.com/blog/upgrading-query-insights-to-met....

jgalt2121y ago

Disk latency, and one's aversion to it, is IMHO the only way Hetzner costs can run up on you. You want to keep the database on local disk, and not their very slow attached Volumes (Hetzner EBS). In short, you can have relatively light work-loads that will be on sort of expensive VMs because you need 500GB, or more, of local disk. 1TB local disk is the biggest VM they offer in the US. 300 EUR a month.

rsanheim1y ago

That great infographic at the top illustrates one big reason why 'dev instances in the cloud' is a bad idea.

cmurf1y ago

Plenty of text but also many cool animations. I'm a sucker for visual aids. It's a good balance.

carderne1y ago

I'm always curious about latency for all these newdb offerings like PlanetScale/Neon/Supabase.

It seems like they don't emphasise strongly enough _make sure you colocate your server in the same cloud/az/region/dc as our db. I suspect a large fraction of their users don't realise this, and have loads of server-db traffic happening very slowly over the public internet. It won't take many slow db reads (get session, get a thing, get one more) to trash your server's response latency.

cynicalsecurity1y ago

That was a cool advertisement, I must give them that.

anonymousDan1y ago

Nice article, but the replicated approach isn't exactly comparing like with like. To achieve the same semantics you'd need to block for a response from the remote backup servers which would end up with the same latency as the other cloud providers...

SAI_Peregrinus1y ago

> The next major breakthrough in storage technology was the hard disk drive.

There were a few storage methods in between tape & HDDs, notably core memory & magnetic drum memory.

samwho1y ago

Gosh, this is beautiful. Fantastic work, Ben. <3

TheAnkurTyagi1y ago

Very nice animations.

r3tr01y ago

We are working on a platform that lets you measure this stuff with pretty high precision in real time.

You can check out our sandbox here:

https://yeet.cx/play

liweixin1y ago

Amazing! The visualizations are so great!

dangoodmanUT1y ago

what local nvme is getting 20us? Nitro?

j / k navigate · click thread line to collapse

153 comments

96 comments · 28 top-level

bddicken1y ago· 28 in thread

tealpod1y ago

Your style of explaination and animation are exceptional.

jasonthorsness1y ago

The visuals are awesome; the bouncing-box is probably the best illustration of relative latency I've seen.

Thought I would note this because one-in-a-million is not great if you have a million customers ;)

bddicken1y ago

> Your "1 in a million" comment on durability is certainly too pessimistic once you consider the briefness of the downtime before a new server comes in and re-replicates everything, right?

Absolutely. Our actual durability is far, far, far higher than this. We believe that nobody should ever worry about losing their data, and thats the peace of mind we provide.

1 more reply

the_arun1y ago

1 more reply

mixermachine1y ago

1 in a million is the probability that all three servers die in one months, without swapping out the broken ones. So at some point in the month all the data is gone.

(i hope this calculation is correct)

If 1% probability per month then 1%/(43800/30) = (1/1460)% probability per 30 min.

For three instances: (1/1460)% * (1/1460)% * (1/1460)% = (1/3112136000)% probability per 30 min that all go down.

Calculated for one month (1/3112136000)% * (43800/30) = (1/2131600)%

So one in 213 160 000 that all three servers go down in a 30 minute time span somewhere in one month. After the 30 minutes another replica will already be available, making the data safe.

I'm happy to be corrected. The probability course was some years back :)

1 more reply

b0rbb1y ago

Edit: And for real, fantastic work, this is awesome.

bddicken1y ago

Thank you! The visuals definitely add something special to this post specifically since time is a big element in explaining latencies.

1 more reply

zalebz1y ago

The level of your effort really shows through. If you had to ballpark guess, how much time do you think you put in? and I realize keyboard time vs kicking around in your head time are quite different

bddicken1y ago

Thank you! I started this back in October, but of course have worked on plenty of other things in the meantime. But this was easily 200+ hours of work spread out over that time.

If this helps as context, the git diff for merging this into our website was: +5,820 −1

dormando1y ago

Half on topic: what libs/etc did you use for the animations? Not immediately obvious from the source page.

Thanks!

bddicken1y ago

I heavily, heavily abused d3.js to build these.

1 more reply

anymouse1234561y ago

I love this kind of datavis.

We are generally bad at internalizing comparisons at these scales. The visualizations make a huge difference in building more detailed intuitions.

Really nice work, thank you!

bddicken1y ago

Yeah I think the visuals really add to this one, especially given the time element of explaining latencies.

hakaneskici1y ago

Great work! Thank you for making this.

This is beautiful and brilliant, and also is a great visual tool to explain how some of the fundamental algorithms and data structures originate from the physical characteristics of storage mediums.

AlphaWeaver1y ago

Were you at all inspired by the work of Bartosz Ciechanowski? My first thought was that you all might have hired him to do the visuals for this post :)

bddicken1y ago

Bartosz Ciechanowski is incredible at this type of stuff. Sam Rose has some great interactive blogs too. Both have had big hits here on HN.

hodgesrm1y ago

I was delighted to see your models of tape operations as I used it a lot in the COBOL days.

logsr1y ago

Amazing presentation. It really helps to understand the concepts.

It is great to see PlanetScale moving in this direction and sharing the knowledge.

bddicken1y ago

Thank you for the info! Do you have any good references on this for those who want to learn more?

1 more reply

alexellisuk1y ago

lizztheblizz1y ago

tombert1y ago

The visualizations are excellent, very fun to look at and play with, and they go along with the article extremely well. You should be proud of this, I really enjoyed it.

bddicken1y ago

Thank you!

layer81y ago

bddicken1y ago

Interesting! Any errors you can report? Should work in safari but maybe you have something custom going on, or an older version?

1 more reply

inetknght1y ago

I don't see a single visual. I don't use the web with javascript. Why not embed static images instead or in addition?

bddicken1y ago

I'm curious, what do you do on the internet without js these days?

1 more reply

vel0city1y ago

They're not just static images or animations, they're interactive widgets.

bob10291y ago· 15 in thread

crazygringo1y ago

> I've been advocating for SQLite+NVMe for a while now.

> Latency is king in all performance matters.

> Especially in those where items must be processed serially.

bob10291y ago

The entire point is to avoid the network hop.

Application <-> SQLite <-> NVMe

has orders of magnitude less latency than

Application <-> Postgres Client <-> Network <-> Postgres Server <-> NVMe

> You should be avoiding serial database queries as much as possible in the first place.

I don't get to decide this. The business does.

4 more replies

conradev1y ago

Until you hit the single-writer limitation in SQLite, you do not need to spend more CPU cycles on Postgres

2 more replies

dangoodmanUT1y ago

The SQLite filesystem is laid out to hedge against HDD defragging. It wouldn't benefit as much as changing it to a more modern layout that's SSD-native, then using NVMe

cynicalsecurity1y ago

If only one thread of writing is required, then SQLite works absolutely great.

bob10291y ago

> If only one thread of writing is required, then SQLite works absolutely great.

The whole point of getting your commands down to microsecond execution time is so that you can get away with just one thread of writing.

Entire financial exchanges operate on this premise.

1 more reply

jstimpfle1y ago

I still measure 1-2ms of latency with an NVMe disk on my Desktop computer, doing fsync() on a file on a ext4 filesystem.

Update: about 800us on a more modern system.

rbranson1y ago

Not so sure that's true. This is single-threaded direct I/O doing a fio randwrite workload on a WD 850X Gen4 SSD:

    write: IOPS=18.8k, BW=73.5MiB/s (77.1MB/s)(4412MiB/60001msec); 0 zone resets
    slat (usec): min=2, max=335, avg= 3.42, stdev= 1.65
    clat (nsec): min=932, max=24868k, avg=49188.32, stdev=65291.21
     lat (usec): min=29, max=24880, avg=52.67, stdev=65.73
    clat percentiles (usec):
     |  1.00th=[   33],  5.00th=[   34], 10.00th=[   34], 20.00th=[   35],
     | 30.00th=[   37], 40.00th=[   38], 50.00th=[   40], 60.00th=[   43],
     | 70.00th=[   53], 80.00th=[   60], 90.00th=[   70], 95.00th=[   84],
     | 99.00th=[  137], 99.50th=[  174], 99.90th=[  404], 99.95th=[  652],
     | 99.99th=[ 2311]

3 more replies

madisp1y ago

dogben1y ago

I believe that's power saving in action. A single operation at idle is slow, the drive needs time to wake from idle.

dzr00011y ago

1 more reply

the84721y ago

I assume fsyncing a whole file does more work than just ensuring that specific blocks made it to the WAL which it can achieve with direct IO or maybe sync_file_range.

delamon1y ago

Here PM983 doing `fio --name=fsync_test --ioengine=sync --rw=randwrite --bs=4k --size=1G --numjobs=1 --runtime=10s --time_based --fsync=1`

  Jobs: 1 (f=1): [w(1)][100.0%][w=183MiB/s][w=46.7k IOPS][eta 00m:00s]
  fsync_test: (groupid=0, jobs=1): err= 0: pid=11905: Fri Mar 14 13:34:34 2025
    write: IOPS=39.1k, BW=153MiB/s (160MB/s)(1527MiB/10001msec); 0 zone resets
      clat (nsec): min=1052, max=223288, avg=1606.69, stdev=2345.64
       lat (nsec): min=1082, max=223458, avg=1653.08, stdev=2346.58
      clat percentiles (nsec):
       |  1.00th=[  1128],  5.00th=[  1176], 10.00th=[  1240], 20.00th=[  1320],
       | 30.00th=[  1448], 40.00th=[  1496], 50.00th=[  1528], 60.00th=[  1576],
       | 70.00th=[  1640], 80.00th=[  1720], 90.00th=[  1816], 95.00th=[  1960],
       | 99.00th=[  2576], 99.50th=[  3376], 99.90th=[ 10816], 99.95th=[ 32640],
       | 99.99th=[124416]
     bw (  KiB/s): min=123168, max=190568, per=99.00%, avg=154788.63, stdev=19610.50, samples=19
     iops        : min=30792, max=47642, avg=38697.16, stdev=4902.62, samples=19
    lat (usec)   : 2=95.61%, 4=4.10%, 10=0.19%, 20=0.04%, 50=0.03%
    lat (usec)   : 100=0.02%, 250=0.01%
    fsync/fdatasync/sync_file_range:
      sync (usec): min=13, max=1238, avg=23.08, stdev= 9.27
      sync percentiles (usec):
       |  1.00th=[   15],  5.00th=[   16], 10.00th=[   16], 20.00th=[   17],
       | 30.00th=[   18], 40.00th=[   25], 50.00th=[   26], 60.00th=[   26],
       | 70.00th=[   26], 80.00th=[   26], 90.00th=[   26], 95.00th=[   27],
       | 99.00th=[   34], 99.50th=[   79], 99.90th=[  101], 99.95th=[  126],
       | 99.99th=[  347]

The same test on SN850X

  Jobs: 1 (f=1): [w(1)][100.0%][w=22.9MiB/s][w=5859 IOPS][eta 00m:00s]
  fsync_test: (groupid=0, jobs=1): err= 0: pid=23328: Fri Mar 14 13:35:04 2025
    write: IOPS=5742, BW=22.4MiB/s (23.5MB/s)(224MiB/10001msec); 0 zone resets
      clat (nsec): min=400, max=110253, avg=797.80, stdev=1244.19
       lat (nsec): min=430, max=110273, avg=826.49, stdev=1248.86
      clat percentiles (nsec):
       |  1.00th=[  502],  5.00th=[  540], 10.00th=[  572], 20.00th=[  612],
       | 30.00th=[  644], 40.00th=[  668], 50.00th=[  708], 60.00th=[  748],
       | 70.00th=[  804], 80.00th=[  868], 90.00th=[ 1032], 95.00th=[ 1176],
       | 99.00th=[ 1560], 99.50th=[ 2224], 99.90th=[ 8384], 99.95th=[23424],
       | 99.99th=[66048]
     bw (  KiB/s): min=19800, max=24080, per=100.00%, avg=23004.21, stdev=1039.13, s  amples=19
     iops        : min= 4950, max= 6020, avg=5751.05, stdev=259.78, samples=19
    lat (nsec)   : 500=0.80%, 750=58.72%, 1000=29.04%
    lat (usec)   : 2=10.89%, 4=0.28%, 10=0.18%, 20=0.04%, 50=0.04%
    lat (usec)   : 100=0.01%, 250=0.01%
    fsync/fdatasync/sync_file_range:
      sync (usec): min=136, max=28040, avg=172.88, stdev=195.00
      sync percentiles (usec):
       |  1.00th=[  145],  5.00th=[  149], 10.00th=[  151], 20.00th=[  151],
       | 30.00th=[  159], 40.00th=[  159], 50.00th=[  159], 60.00th=[  159],
       | 70.00th=[  159], 80.00th=[  161], 90.00th=[  198], 95.00th=[  202],
       | 99.00th=[  396], 99.50th=[  416], 99.90th=[  594], 99.95th=[ 1467],
     | 99.99th=[ 5145]

kev0091y ago

NVMe is just a protocol. There are drives that are absolute shit and others that cost as much as luxury automobiles. In either case not quite DRAM latency because it is expansion bus attached.

1 more reply

sergiotapia1y ago

I had a lot of fun with Coolify running my app and my database on the same machine. It was pretty cool to see zero latency in my SQL queries, just the cost of the engine.

__turbobrew__1y ago· 6 in thread

I think something about distributed storage which is not appreciated in this article:

1. Some systems do not support replication out of the box. Sure your cassandra cluster and mysql can do master slave replication, but lots of systems cannot.

rcrowley1y ago

Replicated network-attached storage that presents a "local" filesystem API is a powerful way to create durability in a system that doesn't build it in like we have.

__turbobrew__1y ago

Agreed, if you are a mature enough and well funded organization you probably should be using NVME and then run distributed systems on top of the NVMEs to manage replication yourself.

3921y ago

This is where s2.dev could in theory come to the rescue. Able to keep up with the streaming bandwidth, but durable.

wmf1y ago

I assume DRBD still exists although it's certainly easier to use EBS.

maayank1y ago

what do you mean by drains?

rcrowley1y ago

If you miss a termination event you miss your chance to copy that data elsewhere. Of course, if you're _always_ copying the data elsewhere, you can rest easy.

ucarion1y ago· 4 in thread

Really, really great article. The visualization of random writes is very nicely done.

On:

> If instead you have your storage attached directly to your compute instance, there are no artificial limits placed on IO operations. You can read and write as fast as the hardware will allow for.

I feel like this might be a dumb series of questions, but:

1. The ratelimit on "IOPS" is precisely a ratelimit on a particular kind of network traffic, right? Namely traffic to/from an EBS volume? "IOPS" really means "EBS volume network traffic"?

2. Does this save me money? And if yes, is from some weird AWS arbitrage? Or is it more because of an efficiency win from doing less EBS networking?

I see pretty clearly putting storage and compute on the same machine strictly a latency win, because you structurally have one less hop every time. But is it also a throughput-per-dollar win too?

rbranson1y ago

> 1. The ratelimit on "IOPS" is precisely a ratelimit on a particular kind of network traffic, right? Namely traffic to/from an EBS volume? "IOPS" really means "EBS volume network traffic"?

> 2. Does this save me money? And if yes, is from some weird AWS arbitrage? Or is it more because of an efficiency win from doing less EBS networking?

It might. It's a set of trade-offs.

ucarion1y ago

That makes sense. The weirdness of https://docs.aws.amazon.com/ebs/latest/userguide/ebs-io-char... makes more sense now. Reminds me of DynamoDB capacity units.

the84721y ago

For network-attached storage IOPS limits packets per second, not bandwidth, since IO operations can happen at different sizes (e.g. 4K vs. 16K blocks).

rbranson1y ago

More specific details for EC2 instances can be seen in the docs here: https://docs.aws.amazon.com/ec2/latest/instancetypes/gp.html...

gozzoo1y ago· 3 in thread

Can someeone share their expirience in creating such diagrams. What libraries and tools can be useful for such interactive diagrams?

bddicken1y ago

For this particular one I used d3.js, but honestly this isn't really the type of thing it's designed for. I've also used GSAP for this type of thing on this article I wrote about database sharding.

https://planetscale.com/blog/database-sharding

Nezteb1y ago

Your diagrams are fantastic; I'd enjoy a short blog post or snippet just on how you used D3 for this!

As someone who has also use GSAP a decent amount, these days I usually have a better experience with SVG.js [1].

[1] https://github.com/svgdotjs/svg.js

Joel_Mckay1y ago

Do you mean something for data visualization, or tricks condensing large data sets with cursors?

https://d3js.org/

Best of luck =3

CSDude1y ago· 2 in thread

jiggawatts1y ago

I looked into this with an idea of running SQL Server Availability Groups on the Azure Las_v3 series VMs, which have terabytes of local SSD.

hodgesrm1y ago

pjdesno1y ago· 2 in thread

I love the visuals, and if it's ok with you will probably link them to my class material on block devices in a week or so.

One small nit: > A typical random read can be performed in 1-3 milliseconds.

Um, no. A 7200 RPM platter completes a rotation in 8.33 milliseconds, so rotational delay for a random read is uniformly distributed between 0 and 8.33ms, i.e. mean 4.16ms.

>a single disk will often have well over 100,000 tracks

For more than you ever wanted to know about hard drive performance and the mechanical/geometrical considerations that go into it, see https://www.msstconference.org/MSST-history/2024/Papers/msst...

bddicken1y ago

Whoah, thanks for sharing the paper.

pjdesno1y ago

I reviewed it three times for different conferences :-)

I’m still annoyed they didn’t include the drain time equation I used for calculating track width, which falls out of one of their equations.

Oh, and I’m very glad you showed differing track sizes across the platter. (BTW, did you know track sizes differ between platters? Google “disks are like snowflakes”)

bloopernova1y ago· 2 in thread

Fantastic article, well explained and beautiful diagrams. Thank you bddicken for writing this!

bddicken1y ago

You are welcome!

TechDebtDevin1y ago

Probably the best diagrams I've ever seen in a blog post.

robotguy1y ago· 1 in thread

Seeing the disk IO animation reminded me of Melvin Kaye[0]:

  Mel never wrote time-delay loops, either, even when the balky Flexowriter
  required a delay between output characters to work right.
  He just located instructions on the drum
  so each successive one was just past the read head when it was needed;
  the drum had to execute another complete revolution to find the next instruction.

[0] https://pages.cs.wisc.edu/~markhill/cs354/Fall2008/notes/The...

Thoreandan1y ago

I was reminded of Mel as well! If you haven't seen it, Usagi Electric on YouTube has gotten a drum-memory system from the 1950s nearly fully-functional again.

jhgg1y ago· 1 in thread

Our workaround was this: https://discord.com/blog/how-discord-supercharges-network-di...

rcrowley1y ago

Neat workaround! We only started working with GCP Local SSDs in 2024 and can report we haven't experienced read or write failures due to bad sectors in any of our testing.

In GCP we find the best results on n2d-highmem machines. In AWS, though, we run on pretty much all the latest-generation types with instance storage.

gz091y ago· 1 in thread

bddicken1y ago

That database architects blog is a great read.

_1tem1y ago· 1 in thread

edit: apparently they build a kafkaesque layer of caching. No thank you, I'll just keep my data on locally attached NVMe.

hodgesrm1y ago

> edit: apparently they build a kafkaesque layer of caching. No thank you, I'll just keep my data on locally attached NVMe.

There are downsides to S3 of course like restarts, which require management to avoid performance issues.

vessenes1y ago· 1 in thread

Great nerdbaiting ad. I read all the way to the bottom of it, and bookmarked it to send to my kids if I feel they are not understanding storage architectures properly. :)

bddicken1y ago

The nerdbaiting will now provide generational benefit!

aftbit1y ago· 1 in thread

bddicken1y ago

magicmicah851y ago

Can I just say that I love how informative this was that I completely forgot it was to promote a product? Excellent visuals and interactivity.

tonyhb1y ago

jgalt2121y ago

rsanheim1y ago

That great infographic at the top illustrates one big reason why 'dev instances in the cloud' is a bad idea.

cmurf1y ago

Plenty of text but also many cool animations. I'm a sucker for visual aids. It's a good balance.

carderne1y ago

I'm always curious about latency for all these newdb offerings like PlanetScale/Neon/Supabase.

cynicalsecurity1y ago

That was a cool advertisement, I must give them that.

anonymousDan1y ago

SAI_Peregrinus1y ago

> The next major breakthrough in storage technology was the hard disk drive.

There were a few storage methods in between tape & HDDs, notably core memory & magnetic drum memory.

samwho1y ago

Gosh, this is beautiful. Fantastic work, Ben. <3

TheAnkurTyagi1y ago

Very nice animations.

r3tr01y ago

We are working on a platform that lets you measure this stuff with pretty high precision in real time.

You can check out our sandbox here:

https://yeet.cx/play

liweixin1y ago

Amazing! The visualizations are so great!

dangoodmanUT1y ago

what local nvme is getting 20us? Nitro?

j / k navigate · click thread line to collapse