The Inner Workings of Distributed Databases (opens in new tab)

(questdb.io)

175 pointsbluestreak3y ago27 comments

27 comments

23 comments · 7 top-level

MuffinFlavored3y ago· 12 in thread

When is the right time to "level up" from "I'm good with just plain old Postgres" to QuestDB, InfluxDB, Patroni, etc.?

> Unfortunately, automatic failover is solved neither by PostgreSQL nor TimescaleDB, but there are 3rd-party solutions like Patroni that add support for that functionality. PostgreSQL describes the process of failover as STONITH (Shoot The Other Node In The Head), meaning that the primary node has to be shot down once it starts to misbehave.

Does QuestDB do "Raft consensus"? I don't see Raft mentioned in the article.

Aren't all distributed databases basically really clever wrappers around write-ahead log + really tight timestamp/clock syncing?

diarrhea3y ago

> Aren't all distributed databases basically really clever wrappers around write-ahead log + really tight timestamp/clock syncing?

As far as I know, the second requirement is often solved differently. Google’s Spanner has tight clock synchronisation via GPS and/or atomic clocks, and will even report uncertainties. Knowing these uncertainties allows it to simply wait them out before committing, for example.

But in general, exact time keeping and clock syncing is often too hard and costly. Luckily, it’s often not required and one can do with logical clocks, such as version vectors or Lamport time stamps. These order events by causality (A before B, B before A, A and B happened concurrently), which eventually allows the WAL to be sorted deterministically.

Things like multi leader with async replication will inevitably run into conflicts though. These will need some sort of resolution (manually or automatically via CRDTs). There’s no way around it due to the builtin, inherent possibility of concurrent writes.

Note that concurrent in these scenarios has essentially nothing to do with time. It’s not about “happened at the same time”. It’s a question of “did A know about B?”. No? Then A can’t be causally dependent on B and they are concurrent events. Exactly like two “parallel” branches in git. They’ll need to be merged later on, and conflicts will need to be resolved.

Lastly, if we can deterministically order events, every node can reach the same conclusions. This is equivalent to consensus.

So my take would be: distributed databases are often about a log of (write) events, and some consensus mechanism to agree upon the exact order in that log. Logical clocks are a good solution for that, but physical clocks ca be made to work as well (Google Spanner).

This is all taken from the book “Designing Data Intensive Applications”, a great read!

moomoo113y ago

You don’t choose a database to “level up”. It’s a tool.

Use the right tool for the right job.

I’ve migrated rdbms to wide column databases like Cassandra or dynamo because we had specific requirements that rdbms were not fulfilling.

I’ve also migrated from document database to rdbms because the document store didn’t meet our specific requirements.

I wouldn’t just use any random database because I want to appear cool (?) because I know Cassandra or how to use a vector database. That’s not the point.

jimbokun3y ago

I want to thank you for the advice to “use the right tool for the job” because it’s certainly not banal, prosaic advice that is invoked in every technology discussion.

Dylan168073y ago

> I wouldn’t just use any random database because I want to appear cool

Obviously! "level up" does not imply you used the wrong thing on purpose and you're switching to another solution that's better in every way, as you seem to have read it.

1 more reply

tabtab3y ago

> I’ve migrated rdbms to wide column databases like Cassandra or dynamo because we had specific requirements that rdbms were not fulfilling.

I'm curious of a common situation. Could RDBMS be improved on that area, or do they inherently lack some necessary property?

I will agree that current RDBMS tend to lack dynamism, and that should be remedied: https://www.reddit.com/r/CRUDology/comments/12ari2l/dynamic_...

2 more replies

fzliu3y ago

Here's a page on the architecture of a distributed vector database (Milvus) for anybody interested: https://milvus.io/docs/architecture_overview.md

omneity3y ago

I wouldn't necessarily call it a level up.

There's a lot of use cases for which Postgres works very well at scale, and the main benefit of a solution like these specialized ones is more of a convenience layer.

hinkley3y ago

> failover as STONITH (Shoot The Other Node In The Head)

What functional consensus protocol doesn't mandate attempted murder? When a node becomes incoherent it can't be relied upon to notice that it has done so and bow out gracefully. Like cancer, there is always a change that 'cell death' will fail and leave you in a pathological state.

grogers3y ago

If your consensus protocol requires that it is probably broken. If you can't rely on a node to shut itself down then you almost certainly can't rely on an external trigger to do it. Paxos, raft, etc work just fine as long as failures are non-byzantine. Achieving non-byzantine failures is definitely not always possible (e.g. someone hacking your server and reprogramming it to subvert the protocol) but checksums on disk and network go most of the way.

olluk3y ago

Perhaps the multi-master approach is the example of system where incoherent does not mean terminal illnesses.

remram3y ago

Most consensus algorithms assume some subset of possible behaviors from the misbehaving nodes. The algorithms that don't are called "Byzantine" and are a very short list (e.g. the situation where a node can lie and maliciously try to misinform other nodes about the state of the system).

If you can tell that a node failed, there are usually other opportunities for circuit-breaking than shooting it, such as at the hypervisor, load-balancer, or even clients.

Andys3y ago

CRDB is almost a drop-in replacement at this point. I personally found it easier to run locally than postgres.

Raminj953y ago· 3 in thread

Is there any book/textbook course out there that goes through how to write a database or dbms from scratch up to something useful, think something like nand to tetris style? I have been looking but there is not much on this topic out there I feel like.

gavinray3y ago

There is a book that is exactly this, "Database Design and Implementation" by Edward Sciore.

You write a database in Java while having the principles explained along the way.

https://www.amazon.com/Database-Design-Implementation-Data-C...

Raminj953y ago

This looks amazing thank you for the recommendation!

_georgesim_3y ago

You can take CMU's 15-445 which is available online with lectures uploaded to youtube: https://15445.courses.cs.cmu.edu/fall2022/.

hartem_3y ago· 1 in thread

What was the most interesting thing that you learned while implementing the WAL? Have you thought about how WAL is going to work in the multi-master setup?

olluk3y ago

We write to WAL and then register the transaction in the transaction sequence registry. If a concurrent transaction registered between the start and the end of the transaction, we update the current uncommitted transaction data with concurrent transactions and re-try registering it in the sequencer again. To scale to multi-master we will move the transaction sequence registry to a service with a consensus algorithm.

franckpachot3y ago

YugabyteDB (open source) uses the Postgres code, to provide all PostgreSQL features, plugged on top of a Spanner-like distributed storage and transactions to scale horizontally: https://docs.yugabyte.com/preview/architecture/layered-archi...

gregwebs3y ago

TiDB and CRDB handles all these scenarios. They are designed for synchronized distributed replication from the ground up and a tremendous amount of engineering work has gone into these systems.

marsupialtail_23y ago

In case people are interested, I wrote a post about fault tolerance strategies of data systems like Spark and Flink: https://github.com/marsupialtail/quokka/blob/master/blog/fau...

The key difference here is that these systems don't store data, so fault tolerance means recovering within a query instead of not losing data.

foodoos3y ago

> we chose our goal to be achieving multi-master replication with Async consistency. We believe that this approach strikes the best balance of fault tolerance and transaction throughput.

"SLOG: Serializable, Low-latency, Geo-replicated Transactions"

https://par.nsf.gov/servlets/purl/10126332

j / k navigate · click thread line to collapse

27 comments

23 comments · 7 top-level

MuffinFlavored3y ago· 12 in thread

When is the right time to "level up" from "I'm good with just plain old Postgres" to QuestDB, InfluxDB, Patroni, etc.?

Does QuestDB do "Raft consensus"? I don't see Raft mentioned in the article.

Aren't all distributed databases basically really clever wrappers around write-ahead log + really tight timestamp/clock syncing?

diarrhea3y ago

> Aren't all distributed databases basically really clever wrappers around write-ahead log + really tight timestamp/clock syncing?

Lastly, if we can deterministically order events, every node can reach the same conclusions. This is equivalent to consensus.

This is all taken from the book “Designing Data Intensive Applications”, a great read!

moomoo113y ago

You don’t choose a database to “level up”. It’s a tool.

Use the right tool for the right job.

I’ve migrated rdbms to wide column databases like Cassandra or dynamo because we had specific requirements that rdbms were not fulfilling.

I’ve also migrated from document database to rdbms because the document store didn’t meet our specific requirements.

I wouldn’t just use any random database because I want to appear cool (?) because I know Cassandra or how to use a vector database. That’s not the point.

jimbokun3y ago

I want to thank you for the advice to “use the right tool for the job” because it’s certainly not banal, prosaic advice that is invoked in every technology discussion.

Dylan168073y ago

> I wouldn’t just use any random database because I want to appear cool

Obviously! "level up" does not imply you used the wrong thing on purpose and you're switching to another solution that's better in every way, as you seem to have read it.

1 more reply

tabtab3y ago

> I’ve migrated rdbms to wide column databases like Cassandra or dynamo because we had specific requirements that rdbms were not fulfilling.

I'm curious of a common situation. Could RDBMS be improved on that area, or do they inherently lack some necessary property?

I will agree that current RDBMS tend to lack dynamism, and that should be remedied: https://www.reddit.com/r/CRUDology/comments/12ari2l/dynamic_...

2 more replies

fzliu3y ago

Here's a page on the architecture of a distributed vector database (Milvus) for anybody interested: https://milvus.io/docs/architecture_overview.md

omneity3y ago

I wouldn't necessarily call it a level up.

There's a lot of use cases for which Postgres works very well at scale, and the main benefit of a solution like these specialized ones is more of a convenience layer.

hinkley3y ago

> failover as STONITH (Shoot The Other Node In The Head)

grogers3y ago

olluk3y ago

Perhaps the multi-master approach is the example of system where incoherent does not mean terminal illnesses.

remram3y ago

If you can tell that a node failed, there are usually other opportunities for circuit-breaking than shooting it, such as at the hypervisor, load-balancer, or even clients.

Andys3y ago

CRDB is almost a drop-in replacement at this point. I personally found it easier to run locally than postgres.

Raminj953y ago· 3 in thread

gavinray3y ago

There is a book that is exactly this, "Database Design and Implementation" by Edward Sciore.

You write a database in Java while having the principles explained along the way.

https://www.amazon.com/Database-Design-Implementation-Data-C...

Raminj953y ago

This looks amazing thank you for the recommendation!

_georgesim_3y ago

You can take CMU's 15-445 which is available online with lectures uploaded to youtube: https://15445.courses.cs.cmu.edu/fall2022/.

hartem_3y ago· 1 in thread

What was the most interesting thing that you learned while implementing the WAL? Have you thought about how WAL is going to work in the multi-master setup?

olluk3y ago

franckpachot3y ago

gregwebs3y ago

TiDB and CRDB handles all these scenarios. They are designed for synchronized distributed replication from the ground up and a tremendous amount of engineering work has gone into these systems.

marsupialtail_23y ago

In case people are interested, I wrote a post about fault tolerance strategies of data systems like Spark and Flink: https://github.com/marsupialtail/quokka/blob/master/blog/fau...

The key difference here is that these systems don't store data, so fault tolerance means recovering within a query instead of not losing data.

foodoos3y ago

> we chose our goal to be achieving multi-master replication with Async consistency. We believe that this approach strikes the best balance of fault tolerance and transaction throughput.

"SLOG: Serializable, Low-latency, Geo-replicated Transactions"

https://par.nsf.gov/servlets/purl/10126332

j / k navigate · click thread line to collapse