Whether it's a good idea depends on your goal, and what alternative buildings blocks you have available.
(Eg if you are building your distributed relational database to run on top of lots of computers with spinning hard disks, you might want to expose some more characteristics of the hard disk directly to your database, so you can manage them; instead of trying to hide them behind an abstraction.)
I think I'd prefer to stop calling _large_ resources that are only K:V a 'database' though.
A 'database' shouldn't require SQL, but a distributed filesystem, however similar, isn't quite a database.
Perhaps the distinction is more pragmatic than fundamentally technical. We typically use the term "database" to describe systems designed primarily for structured data management with query capabilities, while filesystems optimize for hierarchical storage of opaque binary objects.
Therefore a KV is not a database either.
I think 'database' is a term with multiple (related) meanings depending on context.
Another example is the term 'colour'. Depending on context, it sometimes makes sense to call black and white and grey 'colours', and sometimes it's better to treat them as something else.
k->v is a data store (using disk|inmem|networked storage engines).
A database is a complete system for management of data. They come (or used to come) in various data model flavors: hierarchical, graph, relational, etc.
Filesystems present a durable way to store hierarchical binary/textual data. They normally have a very well-defined api used to provide a primitive query language. Sounds a lot like a database, no?
Even internally they are very similar: journalling, paging, tree indexes are normally present in typical popular implementations.
In some classic OS-s there is no separation at all between the concepts of a database and a filesystem.
In a way, a generic durable database can be though of as a special kind of a filesystem. And vice versa.
It's certainly a base for data and can be used to implement all core concepts of relational calculus but it isn't designed for such and doesn't do so with performance in mind. Conversely, filesystems are often implemented using B-trees as many RDBMSes are but aren't designed for many of the operations one might typically ascribe to a database.
Nomenclature is tricky... how does that saying about the two hardest problems in CS go again?
Filesystems are not a relational database, sure, but the word "database" in the context of computer systems, computer science, IT and technology in general doesn't mean "relational database".
Filesystems are definitely a hierarchical database, which is different to a relational database.
Think of KV databases as a persistent associative mapping/hash map that needs to store data in a safe and secure way, then we can build advanced stuff on top of it. Take TiDB for example, it is a distributed database based on MySQL (its own query language can be considered as a subset of MySQL), but actually most of the heavylifting is handled by TiKV, which is a distributed KV datastore with Raft distributed consensus.
And then SurrealDB also leveraged TiKV to build their own graph-document hybrid database product...as one of the data transport. P.S.: used to be a contributor for SurrealDB.
If your workload has even a whiff of analytics to it, operational or slow-time, KV databases are almost the pathological architecture in theory. Their intrinsically poor locality exacts a steep performance price.
These database architectures are all equivalent in the same sense that almost everything is a Turing Machine. Some manifestations and implementations are much more efficient than others in the real world. While I am not as emotionally invested in it as the article’s author seems to be, he is generally correct that KV databases have poor properties for most applications.
The first one I encountered was DrFTPD circa 2004. But these days, any object storage system qualifies because they all support varying replication schemes and reading from any valid in-sync replica.
In my book, a parallel filesystem is not just pooling together a bunch of nodes, but something that can actually support the synchronized accesses needed by a parallel workload. So not just decoupling between data and metadata, but scaling out of the metadata layer as well.
That and a hierarchical namespace (I could be sold on compromising some POSIX compliance for performance reasons, but it has to fundamentally be a hierarchical namespace with similar semantics). So object stores would not qualify.
Some generalisations are close enough to true to be worth it.
Eg I'm fairly confident to generalise and say that for most beginners in 2025 picking Python is a better choice than PHP or Cobol.
Of course, you can come up with some contrived scenarios where the beginner would be better served with Cobol.
Limiting choices to “pick one” seems contrived. Beginners should learn to program and think like programmers, which means learning multiple languages and tools. Programming languages have far more in common with each other than not and the sooner a beginner thinks of themselves as a problem solver rather than a Python programmer the better.
> Of course, you can come up with some contrived scenarios where the beginner would be better served with Cobol.
I don’t think looking at job opportunities and pay qualify as “contrived scenarios.” PHP has a huge footprint in web applications and every beginner steered away from PHP to the saturated Python world forgoes a lot of opportunity. And again nothing prevents learning both. COBOL as usual gets trotted out as the dinosaur, but right now in the current tech job apocalypse knowing COBOL would get a lot more job offers than knowing Python.
IMO key value stores tend to live in the space between a third normal form ultra relational UML diagram database like the college textbooks assure you exist and a high chaos cowboy document storage system like mongodb.
They enable you to make a lot of things up as you go and iterate on your design. I like them because they remove a lot of ceremony around letting me get on with persisting things without having to ALTER TABLE or CREATE TABLE and all that entails. At the same time, they're constrained and often organized in a way that storing big ol' json blobs aren't. I like them for doing multiplayer gamedev things.
But you are right that in practice you sometimes want to deviate from them. And then it is still useful to be aware of what the normal form of your database _would_ be, and how you are deviating.
Similarly to how sometimes you might want to manually unroll a loop in your code, and it's still useful to keep in mind conceptually how the original loop would have looked like.
Fundamentally this isn't a theoretical limitation of the relational model, instead it is a historical artefact that "doesn't have to be that way".
Systems like Kubernetes or Azure Resource Manager show how this ought to have been implemented: Declarative resource definitions via an API with idempotency as a core paradigm.
I.e.: Instead of the web developer having to figure out the delta change between two schema versions, they should just "declare" that they want some specific schema. This could happen with every code deployment, with most such changes doing nothing. If the schema does change, then it's up to the database engine to figure out how to make it happen, much like how a Kubernetes cluster reconciles the cluster state with a deployment definition.
KV stores become popular because they do this implicitly, with no explicit schema at all.
Relational databases (or even KV stores!) could instead use an explicit schema with automatic reconciliation instead of manual delta changes implemented by developers.
TL;DR: The tooling is bad. KV is an over-reaction to this, but instead we just needed better tools.
The changes aren't generated on the fly at runtime because it can prompt you for things it suspects or can't figure out, for example if you rename a column the naive way would be an add+delete, which erases data. If the types are the same, it checks whether you wanted that or a rename, and generates the appropriate migration.
I make my own tools. KV makes it really easy to stop having to care about the database layer really early in a project and live entirely in code. It's fine.
Doen that imply that you should give up on KV datastores today, when this product category he's asking for barely exists? No, obviously not.
KVs give you this behavior, they just drop everything else with it
What is sometimes needed is query plan stability, lack of surprises, and influencing the planner. This is very attainable in the existing SQL databases and is a core feature of the older ones, like Oracle.
Or maybe there is a higher level DSL that you could apply to create query plans (something like MongoDB aggregation pipelines maybe?), but it quickly becomes basically the same as SQL.
https://github.com/permazen/permazen/blob/master/README.md
It's a bit like the record layer in FoundationDB but more advanced. You specify query plans manually, so you can't accidentally forget an index for example.
I think it is because most people can make something work with SQL.
I don't want the dynamic nature of the planner. I don't want to send SQL over the wire, I want to send the already completed plan that I either generated or wrote by hand. So many annoying performance bugs are because the planner did the slow thing. Just let me write/adjust it.
If your use-case is a data warehouse, then you absolutely want more than a K/V database and likely dynamic query plans because the point is dynamic usage. If your use-case is the serving frontend for a >1m request per second API, then sure, you probably don't want the complexity of a relational database and query planner.
Most things are somewhere in the middle and need to give serious consideration to this.
The poster complains about query plans being unnecessarily dynamic; for certain queries, it should be pinned, and only changed in a controlled way. Compare it to something like pip or npm; not being able to pin versions of certain packages could be a source of endless frustrations.
Pinning a query plan to a query could very well be a feature of a relational DB, an it is. Postgres (pg_hint_plan extension), Oracle (a bunch), MS SQL (somehow), they all have ways to pin the query plan. Not sending SQL is calling a stored procedure, also a long-standing feature of relational databases.
Knowing your tech stack goes a long way in battling frustration.
That's not what I mean, I don't want to bother with the SQL layer at all. I want to generate the query plan from the client side and send it off to be executed.
And to my knowledge the hinting extensions don't actually allow you to skip the ceremony and supply your own plan, just (in a very hacky way) adjust it as it runs.
Let database enforce serialization format (JSON, BSON, MessagePack, protobuf.. anything really) + create and maintain indices, using this fancy crash-proof logic it has. That'll cover 95% of all my database needs.
(OP also asks for row-based layout, types, and non-trivial language. I think those parts are entirely optional)
You can attain what you desire by using an RDBMS, and having all tables with one key column, and a TEXT column with your serialized non-key fields; it's going to be a fun approximation of 6NF. Realistically, you can have all joinable columns as normal columns, indexed as you desire, and the rest of the columns as a serialized blob.
When you want high parallelism for guaranteed independent segments of data, use sharding.
Indexes, triggers (very good abstraction covering everything from computed fields to dependent fields), transactions.
In Fox, you write more or less `physical query plans` as syntax:
USE customer && Opens Customer table
CLEAR
SCAN FOR UPPER(country) = 'SWEDEN'
? contact, company, city
ENDSCAN
And what it make this even better, is that you can also write `SQL` so you can have the best of both worlds.BTW, I think this idea can be move even further and my take is at https://tablam.org