With SQLite, you would instead have one database (one file) per user as close to the user as possible that has all of the user's data and you would just read/write to that database. If your application needs to aggregate multiple user's data, then you use something like Litestream to routinely back it up to S3, then when you need to aggregate data you can just access it all there and use a distributed system to do the aggregation on the SQLite database files.
For example, one database per retail store with sharded customers, orders and inventory (most transactions involve one order of one customer fulfilled from that store) and a shared one with item descriptions and prices (modified, say, daily).
No it doesn't. What gave you that idea? You still have all "ACID guarantees" within each database.
>You can’t do cross-database transactions (to my knowledge),
That's true of most databases. If you have two apps and they use two different databases you won't have transactions across those two apps.
>which means you can end up with corrupted data during aggregations.
No, aggregations within each database work as you would expect.
>What am I missing?
As others have said, you use this pattern only if you don't intend to cross databases. By the way, in NoSQL databases like MongoDB, every document is its own little database so having a per user database is a massive upgrade over NoSQL.
- I was talking about ACID guarantees across databases (ie across users) - I was talking about aggregations across databases (ie across users)
Of course working inside one database works as you would expect it to. My point was that this pattern of database-per-user seems to be a totally different design than people have used with traditional n-tier designs.
Good point about NoSQL! But, wasn’t part of the reason MongoDB fell out of favor because it was lacking consistency?
With ActorDB, each shard (what it calls actors) is its own dedicated SQLite database (but efficiently colocated in the same block storage as all the others, so essentially a virtual database). The intention with the sharding is to allow data that logically belongs together to be stored and replicated together as a shard; for example, all of a single user's data. When you want to run an SQL query you tell it which shard to access (as part of the query), and it routes the request to the right server.
It has some pretty clever stuff to allow querying across multiple such "actors", so you can still get a complete view of all your data.
Sadly, I don't think it's maintained anymore. Last commit was in 2019.
For a project I currently have, I am parititioning not only the database but also the software for each "customer" (let's call them that for now), so basically I have my single-threaded Python application that uses SQLite and put that in a container with Litestream and run one of those containers for each customer on one or more kubernetes clusters or on anything else that runs a container (or you can run the software directly). Then you can take the data from one to N of those customers and aggregate it and process it for monitoring and central configuration control and backup with another bit of software that does a distributed query on the S3 data that Litestream is saving. I can also control several of the systems in a coordinated way, if needed, by linking them together in a configuration file and letting them elect a leader that will be the central point of contact for a source of "work" (let's call it that for now) and then that leader can notify the other instances of the work to be done and help to dole out that work in a reasonable manner based on the resources and capabilities of the aggregation.
A few similar projects I worked on that were centralized were a huge mess in both the database and the codebase due to trying to do the same thing for thousands of customers with several dozen horizontal-scaling services that had to be coordinated. In my system, I just write one single-threaded bit of software that does the whole thing, but just for one customer at a time, so the whole system is horizontally scaleable instead of just each service within the system. I can still do the same things to aggregate data from the individual customer software units, and as described above I can still add coordination features for groups of individual customer units, so there is no loss in features, but the whole system is simple to reason and think about with none of the usual synchronization pitfalls you get in the usual distributed system.