Ask HN: Why are there no open source NVMe-native key value stores in 2023? | Better HN

Ask HN: Why are there no open source NVMe-native key value stores in 2023? | Better HN

68 comments

diggan2y ago

I don't remember exactly why I have any of them saved, but these are some experimental data stores that seems to be fitting what you're looking for somewhat:

- https://github.com/DataManagementLab/ScaleStore - "A Fast and Cost-Efficient Storage Engine using DRAM, NVMe, and RDMA"

- https://github.com/unum-cloud/udisk (https://github.com/unum-cloud/ustore) - "The fastest ACID-transactional persisted Key-Value store designed for NVMe block-devices with GPU-acceleration and SPDK to bypass the Linux kernel."

- https://github.com/capsuleman/ssd-nvme-database - "Columnar database on SSD NVMe"

PaulHoule2y ago

See https://www.snia.org/educational-library/key-value-standardi... for some description of the special command set to get an nvme drive to natively work as a key-value store.

Also https://www.snia.org/sites/default/files/ESF/Key-Value-Stora...

laurencerowe2y ago

How do you tell which NVMe drive models support the KV API? Is this something that you can experiment with on a consumer drive or do you need specific enterprise ssd models?

Samsung's uNVMe evaluation guide (from 2019) device support section just states:

    Guide Version: uNVMe2.0 SDK Evaluation Guide ver 1.2
    Supported Product(s): NVMe SSD (Block/KV)
    Interface(s): NVMe 1.2

https://github.com/OpenMPDK/uNVMe/blob/master/doc/uNVMe2.0_S...

I can't find detailed spec sheets detailing which NVMe command sets are supported even for their enterprise drives.

aftbit2y ago

Interesting, never heard of this before! Do you have any other resources to share? How can I play with this today?

mycall2y ago

How is that implemented? Btree, hashtable?

jamesblonde2y ago

RonDB is open-source and supports on-disk data on NVMe disks. http://mikaelronstrom.blogspot.com/2022/04/variable-sized-di...

ashvardanian2y ago

Hey, thanks for the mention! UDisk, however, hasn't been open-sourced yet. Still considering it :)

you could also configure Redis to transact everything to disk and choose nvme as the target

jitl2y ago

That would save via file system, not bypass the kernel to access the NVMe drive directly from user space. NVMe drives themself have a bunch of features that make them amenable to K/V storage directly.

Good overview: https://www.mydistributed.systems/2020/07/towards-building-h...

formerly_proven2y ago

There's actually an NVMe command set which allows you to use the FTL directly as a K/V store. (This is limited to 16-byte keys [1] however, so it is not that useful and probably not implemented anywhere, my guess is Samsung looked at this for some hyperscaler, whipped up a prototype in their customer-specific firmware and the benefits were lesser than expected so it's dead now)

[1] These slides claim up to 32 bytes, which would be a practically useful length: https://www.snia.org/sites/default/files/ESF/Key-Value-Stora... but the current revision of the standard only permits two 64-bit words as the key ("The maximum KV key size is 16 bytes"): https://nvmexpress.org/wp-content/uploads/NVM-Express-Key-Va...

londons_explore2y ago

Presumably there is some way to use the hash of the actual key as the key, and then store both key and value as data?

16 bytes is long enough that collisions will be super rare, and while you obviously need to write code to support that case, it should have no performance impact.

formerly_proven2y ago

A 32-byte key would allow using NVMe KV directly for content-addressed storage; many of those systems use 256-bit / 32-byte cryptographic hashes as keys. Notable exception would be git with 20-byte keys.

londons_explore2y ago

I think some devices built the block storage on top of the key-value store. Ie. when you write "hello..." (4k bytes) to address 123, it actually saves key: 123, value "hello...".

If so, that is probably the reason for a 16 byte key - there is just no way anybody needs a key bigger than 16 bytes for an address anytime soon.

londons_explore2y ago

I could imagine that if this mode isn't widely used, drive manufacturers haven't given much thought to performance, and it therefore might suck.

jiggawatts2y ago

Note that some cloud VM types expose entire NVMe drives as-is directly the guest operating system without hypervisors or other abstractions in the way.

The Azure Lv3/Lsv3/Lav3/Lasv3 series all provide this capability, for example.

Ref: https://learn.microsoft.com/en-us/azure/virtual-machines/las...

rwmj2y ago

Is there not any danger of tenants rewriting the firmware on these drives, and surprising (or compromising) future tenants? AIUI this is the central reason why even "baremetal" cloud instances still have a minimal hypervisor between the tenant and the hardware.

I’m not sure what makes you think an “minimal hypervisor” exists — Oracle Cloud Infrastructure doesn’t have a hypervisor of any sort between you and its .metal instance types. Don’t think Amazon EC2 does either.

wmf2y ago

The top clouds (AWS/Azure/Google) have custom firmware to solve this problem. Second-tier clouds probably don't so customers can reflash firmware.

idanp2y ago

Virtualization can happen in the hardware itself, e.g. SR-IOV.

Why do you mean by non-embedded?

You might also be interested in xNVMe and the RocksDB/Ceph KV drivers:

https://github.com/OpenMPDK/xNVMe

https://github.com/OpenMPDK/KVSSD

https://github.com/OpenMPDK/KVRocks

nphaseOP2y ago

Super helpful, thanks. What I mean is something akin to a single-node daemon with network capabilities. Something as basic as a memcached or redis type of interface to start.

I think there's actually a standard defined for networked KV API over NVMe, written by the SNIA (as others have mentioned)

Though I'm not super knowledgeable about it. I think Redfish/Swordfish are maybe meant for this sort of thing:

https://www.snia.org/forums/smi/swordfish

There's a video on NVMe and NVMe-oF management for instance:

https://www.youtube.com/watch?v=56VoD_1iGIs&list=PLH_ag5Km-Y...

nerpderp822y ago

Eatonphil posted a link to this paper https://web.archive.org/web/20230624195551/https://www.vldb.... a couple hours after this post (zero comments [0])

> NVMe SSDs based on flash are cheap and offer high throughput. Combining several of these devices into a single server enables 10 million I/O operations per second or more. Our experiments show that existing out-of-memory database systems and storage engines achieve only a fraction of this performance. In this work, we demonstrate that it is possible to close the performance gap between hardware and software through an I/O optimized storage engine design. In a heavy out-of-memory setting, where the dataset is 10 times larger than main memory, our system can achieve more than 1 million TPC-C transactions per second.

[0] https://news.ycombinator.com/item?id=37899886

threeseed2y ago

Crail [1] which is a distributed K/V store on top of NVMEoF.

[1] https://craillabs.github.io

nerpderp822y ago

Aerospike does direct NVME access.

https://github.com/aerospike/aerospike-server/blob/master/cf...

There are other occurrences in the codebase, but that is the most prominent one.

Naive question: are there really expected gains to address natively an NVMe disk wrt using a regular key-value database on a filesystem ?

chaos_emergent2y ago

I believe that NVMe uses multiple I/O queues compared to serialized access with SATA and I think you’d be able to side unnecessary abstractions like file systems and block-based access with an NVMe-specific datastore.

I’m also curious if different and more performant data structures can leveraged; if so, there may be downstream improvements for garbage collection, retrieval, and request parallelism.

SATA also has multiple I/O queues. It’s called “NCQ”

The exact semantics vary per protocol but it’s a feature of most protocols at least in the currently used revisions: https://en.wikipedia.org/wiki/Native_Command_Queuing

londons_explore2y ago

Most filesystems will make use of multiple IO queues - ie. if an application sends many different read requests, they may be satisfied out-of-order.

Latency ought to be much better, since you're skipping several abstraction layers in the kernel.

But that's about it. And the latency is still worse than in-memory solutions.

Between that and the non-trivial effort needed to make this work in any sort of cloud setup (be it self-hosted k8s or AWS), it's a hard sell. If I really need latency above all, AWS gives me instances with 24TB RAM, and if I don't… why not just use existing kv-stores and accept the couple of ns extra latency?

Agreed. The classic reason is when you have latency needs, but your data set is large enough that RAM is cost-prohibitive, and random-access enough that disk won’t work. The cost savings from switching to NVMe have to justify the higher NRE cost, and simultaneously, you have to be sensitive to latency.

adgjlsfhk12y ago

While you can get 24TB ram, there is a pretty big cost difference. 2 TB of ram costs roughly $10000 compared to $130 for NVME storage (or $230 for 12 TB of a good hard drive). Sure the NVME is ~3.5x more expensive, but the latency will be dramatically lower and the throughput will be dramatically higher. Sure you can build a 24 TB ram system, but at that point the cost of the server will be entirely the ram. The reason for NVME based storage at this point is that at only ~3.5x the cost of a hard drive, you can switch all your storage over and as long as you don't need tons of storage (i.e. less than 100TB), the SSDs will be a minority of the cost of the system.

threeseed2y ago

Significant gains if you want a distributed key-value database because you can take advantage of NVMEoF.

di4na2y ago

Yes, mostly on the durability side. NVMe actually has the relevant API to be sure that a write was flushed, while posix like filesystem API usually do not handle it.

https://github.com/OpenMPDK/KVRocks

Given however, that most of the world has shifted to VMs, I don't think KV storage is accessible for that reason alone because the disks are often split out to multiple users. So the overall demand for this would be low.

londons_explore2y ago

NVME's allow namespaces to be made - effectively letting multiple users all share an NVME device without interfering with each other.

A note for those unaware, consumer grade NVME devices (basically all m.2 formfactor drives from my experience) only support a single namespace. If you want to explore creating multiple namespaces you will need an enterprise grade u.2 drive.

Some u.2 drives even support thin provisioning, like how a hypervisor treats a sparse disk file but for physical hardware.

Because you haven't written it yet!

infamouscow2y ago

I work on a database that is a KV-store if you squint enough and we're taking advantage of NVMe.

One thing they don't tell you about NVMe is you'll end up bottlenecked on CPU and memory bandwidth if you do it right. The problem is after eliminating all of the speed bumps in your IO pathway, you have a vertical performance mountain face to climb. People are just starting to run into these problems, so it's hard to say what the future holds. It's all very exciting.

caeril2y ago

> non-embedded key value stores or DBs out in the wild yet

I like how you reference the performance benefits of NVMe direct addressing, but then immediately lament that you can't access these benefits across a SEVEN LAYER STACK OF ABSTRACTIONS.

You can either lament the dearth of userland direct-addressable performant software, OR lament the dearth of convenient network APIs that thrash your cache lines and dramatically increase your access latency.

You don't get to do both simultaneously.

Embedded is a feature for performance-aware software, not a bug.

rubiquity2y ago

I think it's mostly because while the internal parallelism of NVMe is fantastic our logical use of them is still largely sequential.

CubsFan10602y ago

Interesting article here: https://grafana.com/blog/2023/08/23/how-we-scaled-grafana-cl...

Utilizing: https://memcached.org/blog/nvm-caching/,https://github.com/m...

TLDR; Grafana Cloud needed tons of Caching, and it was expensive. So they used extstore in memcache to hold most of it on NVMe disks. This massively reduced their costs.

javierhonduco2y ago

There’s Kvrocks. It uses the Redis protocol and it’s built on RocksDB https://github.com/apache/kvrocks

eatonphil2y ago

Does RocksDB speak NVMe directly?

> High-performance storage engines. There are a number of storage engines and key-value stores optimized for flash. RocksDB [36] is based on an LSM-Tree that is optimized for low write amplification (at the cost of higher read amplification). RocksDB was designed for flash storage, but at the time of SATA SSDs, and therefore cannot saturate large NVMe arrays.

From this slightly tangent mention, I am guessing not.

https://web.archive.org/web/20230624195551/https://www.vldb....

Already__Taken2y ago

A seaweedFS volume store sounds like a good candidate to split some of the performance volumes across the nvme queues. You're supposed to give it a whole disk to use anyway.

espoal2y ago

I'm building one: https://github.com/yottaStore/yottaStore

Is there any performance gain over writing append-only data to a file?

I mean, using a merkle tree or something like that to make sense of the underlying data.

Writing to append-only files is a terrible idea if you want to query quickly.

(yes it's fashionable, but it's still terrible for random read performance)

Care to elaborate? How is reading from an append-only file backed by a memory indexed DB slower compared to either 1) a mutated file, or 2) either append-only or mutated raw NVMe disk storage?

I mean, what's the trick NVMe can do to be meaningfully faster?

znpy2y ago

I often attended a presentation by some presales engineer from Aerospike and IIRC they're doing some nvme-in-userspace stuff.

altairprime2y ago

“Lazyweb, find me an NVMe key-value store” is how we phrased requests like this twenty years ago.

Who could afford to develop and maintain such a niche thing, in today’s economy, without either a universal basic income or a “non-free” license to guarantee revenue?

brightball2y ago

SolidCache and SolidQueue from Rails will be doing that when released.

Otherwise though…you have the file system. Is that not enough?

Is that discussion/implementation of nvme available somewhere in public?

https://github.com/rails/solid_cache didn't include anything about NVME that I could find.

andrenotgiant2y ago

I think the original question came up after the recent Rails keynote where they mention that, with NVMe speeds, disk is cheaper and almost as fast as memory, so Redis is not as vital. https://youtu.be/iqXjGiQ_D-A?t=2836

So Solid Cache and Solid Queue just use the database (MySQL), which uses NVMe.

So now, in addition to: "You don't need a queue, just use Postgres/MySQL", we have "You don't need a cache, just use Postgres/MySQL"

ilyt2y ago

It becomes complex when you want to support multiple NVMes

Even more complex when you want to have any kind of redundancy, as you'd essentially need to build-in some kind of RAID-like into your database.

Also few terabytes in RAID10 NVMes + PostgreSQL and something covers about 99% of companies needs for speed.

So you're left with 1% needing that kind of speeds

j / k navigate · click thread line to collapse