undefined | Better HN

0 pointsanarazel9mo ago0 comments

> There are known scenarios in the literature that will cause Postgres to lose data, which TigerBeetle can detect and recover from.

What are you referencing here?

0 comments

jorangreef9mo ago

The scenarios described in our QCon London talk linked above.

This surveys the excellent storage fault research from UW-Madison, and in particular:

  “Can Applications Recover from fsync Failures?”

  “Protocol-Aware Recovery for Consensus-Based Storage”

Finally, I'd recommend watching “Consensus and the Art of Durability”, our talk from SD24 in NYC last year:

https://www.youtube.com/watch?v=tRgvaqpQPwE

kiitos9mo ago

    [disks are] somewhere between non-byzentine fault tolerance and
    Byzantine fault tolerance ... you expect the disk to be almost 
    an active adversary ...
    ...
    so you start to see just a single disk as a distributed system

My goodness, not at all! If you can't trust the interface to a local disk then you're lost just at a fundamental level. And even ignoring that, a disk is an implementation detail of a node in a distributed system, whatever properties that disk may have to that local node are irrelevant in the context of the broader system, and are the responsibility of the local node to manage before communicating anything with other nodes in that broader system.

Combined with https://www.youtube.com/watch?v=tRgvaqpQPwE it seems like the author/presenter is conflating local/disk-related properties/details with distributed/system-based requirements/guarantees. If consensus requires a node to have durably persisted some bit of state before it sends a particular message to other nodes in the distributed system, then it doesn't matter how that persistence is implemented, it only matters how that persistence is observable, disks and FS caches and etc. aren't requirements, they're just one of many possible implementation choices.

jorangreef9mo ago

Recommend you first read the FAST18-winning “Protocol-Aware Recovery for Consensus-Based Storage”.

It’s a mindbender of a paradigm-shift for how to think about local recovery actions in the context of the global consensus protocol!

1 more reply

j / k navigate · click thread line to collapse

0 comments

jorangreef9mo ago

The scenarios described in our QCon London talk linked above.

This surveys the excellent storage fault research from UW-Madison, and in particular:

  “Can Applications Recover from fsync Failures?”

  “Protocol-Aware Recovery for Consensus-Based Storage”

Finally, I'd recommend watching “Consensus and the Art of Durability”, our talk from SD24 in NYC last year:

https://www.youtube.com/watch?v=tRgvaqpQPwE

kiitos9mo ago

    [disks are] somewhere between non-byzentine fault tolerance and
    Byzantine fault tolerance ... you expect the disk to be almost 
    an active adversary ...
    ...
    so you start to see just a single disk as a distributed system

jorangreef9mo ago

Recommend you first read the FAST18-winning “Protocol-Aware Recovery for Consensus-Based Storage”.

It’s a mindbender of a paradigm-shift for how to think about local recovery actions in the context of the global consensus protocol!

1 more reply

j / k navigate · click thread line to collapse