The speaker presents measurements showing how much overhead the wire protocols for various DBs have. MySQL is the best, Postgresql is orders of magnitude worse due to a very inefficient binary format design. The best is still 10x worse than netcat.
Apache Arrow is trying to design a universal protocol for DB access that's more efficient than what's out there currently.
Speaker asserts that scale-out is usually not needed in data analytics, no need to use Spark etc unless you want it on your CV.
Audience member asks "what about multi-user/multi-process access", speaker admits DuckDB basically doesn't do that.
Speaker pitches for using embedded in-proc DBs inside AWS Lambda functions. Not practical to install Oracle RDBMS in something that only runs for 100msec.
A web shell for DuckDB is demonstrated, it uses WASM.
Decentralization is pitched as a reason to avoid 2-tier architecture (separate db engine w/ client protocol).
It's not only unpractical, but hard to get it done. Recently tried to run Postgres in an AWS Lambda to create an anonymized DB dump. It was so painful that I gave up and created an access restricted database to do the anonymization instead. An in-memory mode for Postgres that would be as easy to run as sqlite or duckdb would be so useful for things where one can not replace it with either of them (sql dumps, testing).
https://github.com/zonkyio/embedded-postgres-binaries
I've been using this for test runners in Node and Go for a while now and it's been quite painless. Would be nice to have wider language support though
What we really want is to store some data somewhere, and later be able to retrieve it, without necessarily knowing what it was we stored or where or how. And we don't want to think about what server it's on, or what hard drive, or what folder. And we don't want to think about client protocols or query languages.
All of that would be possible if we reinvented i/o. Basically, just imagine what you want your experience to be, and then start making up names for functions that do that. Stuff that in a kernel, or a standard library. Now you have i/o that's based on how you really want to use data. The backend implementation of it can vary, but the point is to make the user experience what we actually want rather than what somebody else thinks is practical. Make the data interface you want to use, and make it a standard.
What happened, is we discovered that files are really useful because you don't need to declare the format of data that goes into the file. So the operating system can handle things like reading and writing and the application can organise how it wants to keep the data in the file.
The same really is for sockets. It is really useful to have somebody transfer the data for you in a stream and you, the application, only worry about the format of the data.
Bytes are a "narrow waist" and in-fact DBs actually use our system for storage. By supporting bytes, anything that can be serialized can be stored by the next layer up and the contract is very simple.
Different types of data are legal in different jurisdictions (for example the definition of PII data), the physical location of the hard drive matters.
When medical data is stored, where and how is important. When handling data that needs to, legally, needs an audit trail, abstractions won't do.
When data is needed at low latency, the details matter. When cost is important (egress charges per operation or counted by size of data transfered), details matter.
Not exactly: what matters is the legal designation of the data storage device. The location of that device is one of many factors that "matter", but not to the application, or developer, or user. They only "matter" to the law. We aren't going to start writing UnitedStatesFileWrite() functions, now, are we?
Instead of considering the physical location of a hard drive, what we should be doing is querying a data storage object which has the properties we want:
io_construct = DataStorage()
storage_search = io_construct.DataStorageSearch({
"contains": [
{ "legal": {
"jurisdiction": {
"location": [ {
"country": "US",
"state": "California"
} ]
}
}
},
{ "record": [ { "email": "foo@bar.domain" } ]
}
]
})
with io_object as io_construct.AttachDataStorage(device = storage_search):
io_object.read()
We should never have to think about what building a hard drive is located in, much less the complexities of dealing with specific data laws. The IO construct should deal with that.1. Price
2. Brand of whoever is providing the storage (matters because it's a proxy for lots of other details)
3. General physical location
Once you made those decisions services like S3 abstract the rest. There are tools that let you access these via FUSE (in which case client protocols don't matter).
the paper is here: https://15721.courses.cs.cmu.edu/spring2023/papers/15-networ...
it's a super-contrived example that's not using any of the functionality of the database and is just using it as "cat"
basically just doing cat over localhost, well, what a surprise, if you add a layer of serialisation of course it's slower that just doing memcpy()
if you're using your database to store files... maybe don't do that
There are other things wrong with the talk, it takes way too long to get to the point for one thing. DuckDB is cool and all but most of data management is getting the data in the right format/place and doing security or stuff like that, not running some query.
not for a database
If that document is just three small fields, then you just effectively succeeded receiving maybe couple packets before the server gave up. Pitiful.
Change the batch size to maybe 2 or 20 thousand, enable network compression, increase client read buffer size from its ridiculously low default size, and this could start looking more like a data transfer we expect.
The database can't always do the data reduction and analysis you want to do quickly, and even in many of the cases where it can, trying to tell it about them in SQL and stored procedures can be pretty gross.
I say this as a huge proponent of SQL, stored procedures, and doing lots of work in the database.