Example of some questions I'd like to be able to answer or at least make reasonable decisions about (note: I don't actually want any answers to the above now, they're just examples of the sort of thing I'd like to read in depth about, and build up some background knowledge):
* how to ensure data's been safely written (e.g. when to flush, fsync, what
guarantees that gives, using WAL)
* blocks sizes to read/write for different purposes, tradeoffs, etc.
* considerations for writing to different media/filesystems (e.g. disk, ssd, NFS)
* when to rely on OS disk cache vs. using own cache
* when to use/not use mmap
* performance considerations (e.g. multiple small files vs. few larger ones,
concurrent readers/writers, locking, etc.)
* OS specific considerations
I recall reading some posts (related to Redis/SQLite/Postgres) related to this, which made me realise that it's a fairly complex topic, but not one I've found a good entry point for.Any pointers to books, documentation, etc. on the above would be much appreciated.
(Of course, FIO's options aren't quite exhaustive, but I can only think of two of three things where I've had to wrap or extend FIO to achieve what I needed.)
* Bigger blocks = better performance. The bigger you can make it the faster you'll go. Your limiting factor is usually the desired resolution of the user (i.e. aggregation will inevitably result in under-utilized space).
* Disk, SSD and NFS don't all belong to the same category. Most modern products in storage are developed with the expectation that the media is SSD. Virtually nobody wants to enter the market of HDDs. The performance gap is just too big, and the existing products that still use HDDs rely on fast caching in something like flash memory anyways. NFS is a hopelessly backwards and outdated technology. It's the least common denominator, and that's why various storage products do support it, but if you want to go fast, forget about it. The tradeoff here usually is between writing your own client (usually, a kernel module) to do I/O efficiently, or spare users the need for installing a custom kernel module (often a security audit issue) and let them go slow...
* OS disk cache is somewhat of a misnomer. There are also two things that might get confused here. OS doesn't cache data written to disk -- the disk does. OS provides mechanism to talk to the disk and instruct it to flush the cache. There's also filesystem cache -- that's what OS does. It caches in the memory it manages the file contents of recently accessed files.
* I/O through mmap is a gimmick. Just one of the ways to abuse system API to do something it's not really intended to do. You can safely ignore it. If you are looking into making I/O more efficient, look into uring_io.
* Big distributed storage systems still use hdd's, usually within a tired system including ssds and nvme.
* A good nfs server implementation will beat the pants off all the cloud vendors. It's still highly relevant in physical datacenters.
* Mmap is used heavily in a ton of software for good reason. On top of that it's part of the POSIX API.
* While block size is one of those things where it usually doesn't matter until it does, just staying bigger blocks is faster is a bit misleading.
> Big distributed storage systems still use hdd's
So what? Did you read what I wrote? I wrote about developing new, not supporting old...
> A good nfs server implementation will beat the pants off all the cloud vendors.
What are you even talking about? What cloud vendors have to do with this? Did you read what you replied to?
> Mmap is used heavily in a ton of software for good reason
So what? OP is asking in the context of writing a database / disk I/O. It's a wrong system API to do that. It's intended for applications to "easily" saving their in-memory data. If that's what it's used for, then it's fine. If it's used to implement a filesystem, then the filesystem authors don't understand what they are doing. Also, being part of POSIX or any other standard doesn't warrant a magical resilience to being a bad functionality... just look at the history of UNIX / Linux repeatedly failing to come up with an interface for asynchronous I/O, and sure enough, all these iterations made it into the standard.
> just staying bigger blocks is faster is a bit misleading.
It's not misleading. For the one paragraph answer, it's perfectly correct. And, no, block size is a very important aspect of any storage system, it's not something that may not matter.
---
As an aside: you sound pretentious, and try to pass as knowledgeable by saying things that have a drop of truth, but are mostly fancily dressed nonsense. Just stop. It's embarrassing.
Books can't get you very far. All they're really good for is informing that exploration.
The key is to hold yourself accountable. It's easy to build sandcastles and think you're a brilliant architect if you don't try jumping on top of them. The iteration loop is what forces you to learn: pick something about your thing you can test objectively, and then make it better.
One of my favorite things to do is to implement something boring and standard, but with an unusual arbitrary design constraint that forces me to rethink the normal approach.
Redis from Scratch will walk you through the very basics.
Want to know how to build compilers? Crafting Interpreters (while very wordy) will walk you (painstakingly) through the very basics.
Want to build a basic server? Beej on Linux Networking.
Build a time-series database to handle back-testing for automated trading systems.
Slap on a bare-bones SQL interpreter onto it.
Now add networking so you can deploy it somewhere.
Now figure out what’s wrong with it (is the performance merely slow or are there serious pitfalls a la MongoDB?)
How do you handle multi user environments?
How you optimize for filesystem throughput while maintaining ACID? Are you like Mongo where you just queue everything and return an ACK — or do you only ACK back when you’ve successfully written to disk?
What’s your protocol for communicating with your DB?
How about sharding or distributed storage?
Hot/cold data swapping?
Execution engine or hand-crafted data retrieval semantics a la q (lang)?
How about remote direct memory access (RDMA) to get past the kernel? How about regular old kernel bypass?
How you handle a catastrophic failure where I physically pull the plug on your machine?
Are you using SIMD/vectorization?
There’s so much you could do. Pick whatever interests you the most.
An SSD is just a completely different beast from a spinning disk. Spinning disks are much slower and really want to read things linearly.
Most of your other questions - block sizes, caching, mmapping, file size, concurrency - the answers will be completely different on SSD vs spinning disk.
SSD/NVMe/etc drives are great for tiny amounts of storage, but if you need to process something substantial, like petabytes of data, you need to know how to write code to operate on spinning HDs.
Most people in ML or data science should default to writing for spinning disks, as they often have to deal with >>4TB of data.
In particular, making good use of SSDs requires your application to be able to issue many simultaneous IO requests, which hard drives are relatively bad at handling and many IO APIs from before the SSD era make difficult or impossible.
I highly recommend Gregg's Systems Performance (2nd edition came out in 2020). While the book is focused on performance rather than development, Gregg does a great job explaining a huge number of concepts without going too deep, specifically related to memory, fs, and block I/O.
Unfortunately, in terms of many of the things you care about, books tend to be outdated. Kerrisk's Linux Programming Interface is over 10 years old, and covers only ext2. Robert Love's great books on the kernel are hugely useful (though less intended for application developers) but also slightly outdated.
As far as books are concerend: - Database Internals
- Designing Data Intensive Applications
- Disk-Based Algorithms for Big Data
- Database Systems by Ullman et al (http://infolab.stanford.edu/~ullman/pub/dscbtoc.txt) Part IV covers implementation details of a database system.
I'm in the middle of reading Designing Data Intensive Applications, and Database Internals is next on my list, but hadn't come across the other two yet - have added to my reading list now.
https://stackoverflow.com/questions/75697877/why-is-liburing...
If you use uring (such as via liburing) on Linux it forces you to split your IO in half: submit and then wait for callback but you can still do other things. You can submit multiple writes or reads in parallel and handle them when they're ready.
This white paper talks about writing to disk in S3 in a scheduled order to avoid corruption of concurrent requests in the event of a crash.
"Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3"
Apache Kafka supposedly doesn't need fsync due to the recovery protocol. So you might want to investigate why that is the case and whether or not you can create the same behaviour.
A good start on device drivers (old 2.6.10 kernel but still good): https://lwn.net/Kernel/LDD3/
Operating Systems: Three Easy Pieces - Files and directories: https://pages.cs.wisc.edu/~remzi/OSTEP/file-intro.pdf (Another great book from an OS perspective with some userspace interactions)
A thorough Linux internal engineering book is: https://mirrors.edge.kernel.org/pub/linux/kernel/people/paul... (The bibliography has tons of links on topics you might be interested in, Chapter 7 on locks is great)
I recommend implementing a basic key/value single table "database" in C/C++ and then add threading/multi-process interfaces so you can mentally figure out all the pros/cons. It's not technically "hard" and you'll learn a lot.
- "Practical Filesystem Design": http://www.nobius.org/dbg/practical-file-system-design.pdf
- "Robert Love: Linux Kernel Development (chapters 13-14)" https://www.amazon.com/Linux-Kernel-Development-Robert-Love/...
- "The Linux Programming Interface (File I/O chapters)": https://www.amazon.com/Linux-Programming-Interface-System-Ha...
https://www.youtube.com/watch?v=oeYBdghaIjc&list=PLSE8ODhjZX...
What I did to learn the lower-level APIs, and perform initial testing on the driver, was write a "mirror" drive. The user-mode code pointed to a folder on disk, the driver made a virtual disk drive, and all reads and writes in the virtual disk drive went to the mirror folder. All of our (cough) unit tests for virtual drive handling used the mirror drive. ("cough" because the tests fit into that happy area that truly is a unit test but drives people nuts about splitting hairs about the semantics between unit and integration tests.)
On Windows, you can implement something like that using Dokany, Dokan, or Winfsp. On linux, there's the Fuse API. On Mac, there's MacFUSE.
Even if you don't do a "mirror" drive, understanding the callbacks that libraries like Dokany, Dokan, Winfsp, and Fuse do helps you understand how IO happens in the driver. Many IO methods provided in popular languages provide abstractions above what the OS does. (For example, the Windows kernel has no concept of the "Stream" that's in your C# program. The "Stream"'s Position property is purely a construct within the .Net framework.)
https://github.com/dokan-dev/dokany
Another place to start is the OS's documentation itself. For example, you can start with Window's CreateFileA function. This typically is what gets called "under the hood" in most programming languages when you open or create a file: https://learn.microsoft.com/en-us/windows/win32/api/fileapi/...
Too many times have I seen some data scientist trying to parse and write 6 Petabytes of data with multiple cores, while the disk is thrashing about.
Spinning disks are still the backbone of most data science operations because they deal with >>4TB datasets, which can't be stored in SSD drives without breaking some serious bank.
So yes, understanding how to properly use producer/consumer multiprocessing queues correctly should be taught to everyone who does computing as the standard template.
Disk thrash is a threat.
https://blog.koehntopp.info/2023/05/05/50-years-in-filesyste...
https://codecapsule.com/2014/02/12/coding-for-ssds-part-1-in...
https://databasearchitects.blogspot.com/2021/06/what-every-p...
https://itnext.io/modern-storage-is-plenty-fast-it-is-the-ap...