Two Tier Architectures Are Anachronistic [video] (opens in new tab)

(tele-task.de)

40 pointsamcaskill2y ago24 comments

24 comments

It's a tech talk so I watched it and made some notes. It's an argument for using in-process databases when doing data science, rather than an external DB. The speaker pitches DuckDB as a concrete example, which seems to be an in-process DB for Python data frames.

The speaker presents measurements showing how much overhead the wire protocols for various DBs have. MySQL is the best, Postgresql is orders of magnitude worse due to a very inefficient binary format design. The best is still 10x worse than netcat.

Apache Arrow is trying to design a universal protocol for DB access that's more efficient than what's out there currently.

Speaker asserts that scale-out is usually not needed in data analytics, no need to use Spark etc unless you want it on your CV.

Audience member asks "what about multi-user/multi-process access", speaker admits DuckDB basically doesn't do that.

Speaker pitches for using embedded in-proc DBs inside AWS Lambda functions. Not practical to install Oracle RDBMS in something that only runs for 100msec.

A web shell for DuckDB is demonstrated, it uses WASM.

Decentralization is pitched as a reason to avoid 2-tier architecture (separate db engine w/ client protocol).

TN1ck2y ago

> Speaker pitches for using embedded in-proc DBs inside AWS Lambda functions. Not practical to install Oracle RDBMS in something that only runs for 100msec.

It's not only unpractical, but hard to get it done. Recently tried to run Postgres in an AWS Lambda to create an anonymized DB dump. It was so painful that I gave up and created an access restricted database to do the anonymization instead. An in-memory mode for Postgres that would be as easy to run as sqlite or duckdb would be so useful for things where one can not replace it with either of them (sql dumps, testing).

wild_egg2y ago

You may be interested in this

https://github.com/zonkyio/embedded-postgres-binaries

I've been using this for test runners in Node and Go for a while now and it's been quite painless. Would be nice to have wider language support though

1 more reply

fulafel2y ago

DuckDB is actually a interesting case because it seems to have some history of memory corruption / segfault problems. The robustness provided by a process boundary is traditionally been valued a lot, though usually for keeping the app from corrupting the db and not the other way around.

city_guy_12y ago

So this is only relevant for personal developer environments?

0xbadcafebee2y ago

One of the major failures of the modern computer science age (among others) is a lack of direction away from traditional i/o. We still are stuck on files and directories and tcp sockets. Yet what we actually want to do with i/o is not read a file from a local disk, or connect to a server and transmute the contents of the file over some additional protocol.

What we really want is to store some data somewhere, and later be able to retrieve it, without necessarily knowing what it was we stored or where or how. And we don't want to think about what server it's on, or what hard drive, or what folder. And we don't want to think about client protocols or query languages.

All of that would be possible if we reinvented i/o. Basically, just imagine what you want your experience to be, and then start making up names for functions that do that. Stuff that in a kernel, or a standard library. Now you have i/o that's based on how you really want to use data. The backend implementation of it can vary, but the point is to make the user experience what we actually want rather than what somebody else thinks is practical. Make the data interface you want to use, and make it a standard.

onetimeuse923042y ago

Many decades ago, before we had files, we had data storages that do what you describe. They stored records of data. No need for files.

What happened, is we discovered that files are really useful because you don't need to declare the format of data that goes into the file. So the operating system can handle things like reading and writing and the application can organise how it wants to keep the data in the file.

The same really is for sockets. It is really useful to have somebody transfer the data for you in a stream and you, the application, only worry about the format of the data.

kuchenbecker2y ago

A junior engineer on my team asked me why we store bytes in our Blobstore/Filesystem rather than something structured like a DB.

Bytes are a "narrow waist" and in-fact DBs actually use our system for storage. By supporting bytes, anything that can be serialized can be stored by the next layer up and the contract is very simple.

ozr2y ago

We have this right now. It's abstractions on top of the real primitives. That's what client protocols and query languages are.

devbent2y ago

> And we don't want to think about what server it's on, or what hard drive, or what folder. And we don't want to think about client protocols or query languages.

Different types of data are legal in different jurisdictions (for example the definition of PII data), the physical location of the hard drive matters.

When medical data is stored, where and how is important. When handling data that needs to, legally, needs an audit trail, abstractions won't do.

When data is needed at low latency, the details matter. When cost is important (egress charges per operation or counted by size of data transfered), details matter.

0xbadcafebee2y ago

> the physical location of the hard drive matters

Not exactly: what matters is the legal designation of the data storage device. The location of that device is one of many factors that "matter", but not to the application, or developer, or user. They only "matter" to the law. We aren't going to start writing UnitedStatesFileWrite() functions, now, are we?

Instead of considering the physical location of a hard drive, what we should be doing is querying a data storage object which has the properties we want:

  io_construct = DataStorage()
  storage_search = io_construct.DataStorageSearch({
    "contains": [
      { "legal": {
          "jurisdiction": {
            "location": [ {
              "country": "US",
              "state": "California"
            } ]
          }
        }
      },
      { "record": [ { "email": "foo@bar.domain" } ]
      }
    ]
  })
  with io_object as io_construct.AttachDataStorage(device = storage_search):
    io_object.read()

We should never have to think about what building a hard drive is located in, much less the complexities of dealing with specific data laws. The IO construct should deal with that.

mike_hearn2y ago

I think the details of IO are already abstracted pretty well, it's a topic that's had a lot of effort put into it. The remaining things you have to think about are pretty fundamental and not fundamentally technical in nature, like:

1. Price

2. Brand of whoever is providing the storage (matters because it's a proxy for lots of other details)

3. General physical location

Once you made those decisions services like S3 abstract the rest. There are tools that let you access these via FUSE (in which case client protocols don't matter).

Kinrany2y ago

Wouldn't the data interface still be a stream of bytes?

0xbadcafebee2y ago

No reason it has to be. You could have the data interface accept plugins which preprocess data in different formats and expose it as something else, like an object, document, stream of documents, etc.

1 more reply

blibble2y ago

that chart of the "inefficiency of client protocols" tripped my bullshit alarm

the paper is here: https://15721.courses.cs.cmu.edu/spring2023/papers/15-networ...

it's a super-contrived example that's not using any of the functionality of the database and is just using it as "cat"

basically just doing cat over localhost, well, what a surprise, if you add a layer of serialisation of course it's slower that just doing memcpy()

if you're using your database to store files... maybe don't do that

DrDroop2y ago

I know of a DSP Engineer that used memcpy as a baseline to compare the speed of a sound filter. I think it is a good measure for first principle thinking.

There are other things wrong with the talk, it takes way too long to get to the point for one thing. DuckDB is cool and all but most of data management is getting the data in the right format/place and doing security or stuff like that, not running some query.

blibble2y ago

memcpy seems like a reasonable baseline for a function designed to operate on things in memory

not for a database

1 more reply

onetimeuse923042y ago

Also I know some of these databases. For example, if you use something like MongoDB with its default configuration, it will be slow as molasses. It will send 20 documents over the network (default cursor batch size) and then yield its time to the operating system and wait for further instructions.

If that document is just three small fields, then you just effectively succeeded receiving maybe couple packets before the server gave up. Pitiful.

Change the batch size to maybe 2 or 20 thousand, enable network compression, increase client read buffer size from its ridiculously low default size, and this could start looking more like a data transfer we expect.

mlyle2y ago

For a lot of data science/analytics, what you really want is "cat" of the data.

The database can't always do the data reduction and analysis you want to do quickly, and even in many of the cases where it can, trying to tell it about them in SQL and stored procedures can be pretty gross.

I say this as a huge proponent of SQL, stored procedures, and doing lots of work in the database.

sertbdfgbnfgsd2y ago

Starts by saying "i sent the abstract drunk, then i had to create a talk", basically admitting he started with the conclusion and then built the argument.

vrosas2y ago

Well at least he’s honest. Pretty much every talk I’ve given was a shotgun of random abstracts that I came up with content for later. The downside of conference talks being a required component of promotion packets.

j / k navigate · click thread line to collapse

24 comments

mike_hearn2y ago

Apache Arrow is trying to design a universal protocol for DB access that's more efficient than what's out there currently.

Speaker asserts that scale-out is usually not needed in data analytics, no need to use Spark etc unless you want it on your CV.

Audience member asks "what about multi-user/multi-process access", speaker admits DuckDB basically doesn't do that.

Speaker pitches for using embedded in-proc DBs inside AWS Lambda functions. Not practical to install Oracle RDBMS in something that only runs for 100msec.

A web shell for DuckDB is demonstrated, it uses WASM.

Decentralization is pitched as a reason to avoid 2-tier architecture (separate db engine w/ client protocol).

TN1ck2y ago

> Speaker pitches for using embedded in-proc DBs inside AWS Lambda functions. Not practical to install Oracle RDBMS in something that only runs for 100msec.

wild_egg2y ago

You may be interested in this

https://github.com/zonkyio/embedded-postgres-binaries

I've been using this for test runners in Node and Go for a while now and it's been quite painless. Would be nice to have wider language support though

1 more reply

fulafel2y ago

city_guy_12y ago

So this is only relevant for personal developer environments?

0xbadcafebee2y ago

onetimeuse923042y ago

Many decades ago, before we had files, we had data storages that do what you describe. They stored records of data. No need for files.

The same really is for sockets. It is really useful to have somebody transfer the data for you in a stream and you, the application, only worry about the format of the data.

kuchenbecker2y ago

A junior engineer on my team asked me why we store bytes in our Blobstore/Filesystem rather than something structured like a DB.

ozr2y ago

We have this right now. It's abstractions on top of the real primitives. That's what client protocols and query languages are.

devbent2y ago

> And we don't want to think about what server it's on, or what hard drive, or what folder. And we don't want to think about client protocols or query languages.

Different types of data are legal in different jurisdictions (for example the definition of PII data), the physical location of the hard drive matters.

When medical data is stored, where and how is important. When handling data that needs to, legally, needs an audit trail, abstractions won't do.

When data is needed at low latency, the details matter. When cost is important (egress charges per operation or counted by size of data transfered), details matter.

0xbadcafebee2y ago

> the physical location of the hard drive matters

Instead of considering the physical location of a hard drive, what we should be doing is querying a data storage object which has the properties we want:

  io_construct = DataStorage()
  storage_search = io_construct.DataStorageSearch({
    "contains": [
      { "legal": {
          "jurisdiction": {
            "location": [ {
              "country": "US",
              "state": "California"
            } ]
          }
        }
      },
      { "record": [ { "email": "foo@bar.domain" } ]
      }
    ]
  })
  with io_object as io_construct.AttachDataStorage(device = storage_search):
    io_object.read()

We should never have to think about what building a hard drive is located in, much less the complexities of dealing with specific data laws. The IO construct should deal with that.

mike_hearn2y ago

1. Price

2. Brand of whoever is providing the storage (matters because it's a proxy for lots of other details)

3. General physical location

Once you made those decisions services like S3 abstract the rest. There are tools that let you access these via FUSE (in which case client protocols don't matter).

Kinrany2y ago

Wouldn't the data interface still be a stream of bytes?

0xbadcafebee2y ago

1 more reply

blibble2y ago

that chart of the "inefficiency of client protocols" tripped my bullshit alarm

the paper is here: https://15721.courses.cs.cmu.edu/spring2023/papers/15-networ...

it's a super-contrived example that's not using any of the functionality of the database and is just using it as "cat"

basically just doing cat over localhost, well, what a surprise, if you add a layer of serialisation of course it's slower that just doing memcpy()

if you're using your database to store files... maybe don't do that

DrDroop2y ago

I know of a DSP Engineer that used memcpy as a baseline to compare the speed of a sound filter. I think it is a good measure for first principle thinking.

blibble2y ago

memcpy seems like a reasonable baseline for a function designed to operate on things in memory

not for a database

1 more reply

onetimeuse923042y ago

If that document is just three small fields, then you just effectively succeeded receiving maybe couple packets before the server gave up. Pitiful.

mlyle2y ago

For a lot of data science/analytics, what you really want is "cat" of the data.

I say this as a huge proponent of SQL, stored procedures, and doing lots of work in the database.

sertbdfgbnfgsd2y ago

Starts by saying "i sent the abstract drunk, then i had to create a talk", basically admitting he started with the conclusion and then built the argument.

vrosas2y ago

j / k navigate · click thread line to collapse