Hermitage: Testing the “I” in ACID (opens in new tab)

(martin.kleppmann.com)

110 pointsmartinkl11y ago33 comments

33 comments

27 comments · 10 top-level

jeffdavis11y ago· 6 in thread

"...able withdraw more money than they had in their account...Most so-called ACID databases — for example Postgres, MySQL, Oracle or MS SQL Server — would not have prevented this race condition in their default configuration."

The author didn't really show that this was true. He obviously understands isolation well, so I would tend to believe him, but it would be nice to see an example.

In a simple case (not sure that it matches the exchange's case), postgresql in any configuration will prevent this problem:

    CREATE TABLE account(
      id int8,
      balance numeric, check (balance >= 0)
    );

No matter what concurrent activity you have going on, it's impossible (as far as I know) to end up seeing a balance less than 0. That's actually true in any isolation mode that postgres supports (read committed, snapshot isolation, and truly serializable).

Does someone have a counterexample, or more details about the case at the exchange that would not be solved by postgres?

I'd also like to point out that the postgres's implementation of true serializability performs quite well and there isn't much of a cost to using it over snapshot isolation.

jrapdx311y ago

The key phrase is the line quoted is "default configuration". According to PostgreSQL docs, the default isolation level is "READ COMMITTED".

Your example check constraint indeed appears to work as advertised. Trying

    UPDATE account SET balance = -1.00 where id = 1;

gives an ERROR message re: new row for relation "account" violates check constraint ...

Still it's not clear exactly what the author sees re: where postgres fails. I'll have to read the article again. With any luck someone more knowledgeable than I am will shed light on the question.

jasonwatkinspdx11y ago

In short: prior to version 9.2, postgres's "serializable" level still allowed some transaction patterns that would be rejected by a strictly serial execution. Later versions prevent it.

This paper: http://dl.acm.org/citation.cfm?id=1376690 outlines the general issue. This paper discusses Postgres's solution: http://drkp.net/papers/ssi-vldb12.pdf

jasonwatkinspdx11y ago

It's typically called Write Skew. Wikipedia discusses it on the snapshot isolation page (http://en.wikipedia.org/wiki/Snapshot_isolation).

Also, Peter Bailis's blog has info and links to papers if you want more detail (http://www.bailis.org/blog/when-is-acid-acid-rarely/)

jeffdavis11y ago

I know, in general, about various kinds of serialization anomalies, and I know (in moderate detail) how all of the isolation modes are implemented in postgres.

What I don't know is about the particular problem the exchange encountered, and why postgres's isolation would not have prevented it. Based on the very brief descriptions of the problem that I've seen, it seems like postgres would have prevented it, but I don't have enough information to say for sure.

1 more reply

wbkang11y ago

I think this could happen if you do not use 'compare-and-swap' and simply update account set balance=balance-9 twice at the same time when the account balance is 10. The account balance won't go below zero but you ended up withdrawing twice.

jeffdavis11y ago

But that will not happen in postgres on any version in any configuration with any isolation level. It just can't happen.

1 more reply

jeffdavis11y ago· 4 in thread

I strongly rebut the following claim, which is central to the article:

"Internet commenters, in their infinite wisdom, were quick to point out that if you’re dealing with money, you had better use an ACID database. But there was a major flaw in their argument. Most so-called ACID databases — for example Postgres, MySQL, Oracle or MS SQL Server — would not have prevented this race condition in their default configuration."

I hesitate because I don't really understand the details of the situation the exchange faced. But, going by the linked references here:

https://bitcointalk.org/index.php?topic=499580 http://www.reddit.com/r/Bitcoin/comments/1wtbiu/how_i_stole_...

it appears that the pattern in question, if translated very unnaturally to SQL, is something like:

    CREATE TABLE account(id int8, balance numeric);
    ...
    BEGIN;
    SELECT balance FROM account WHERE id = 123;
    -- application sees 100, subtracts 90, sees that
    -- it's still positive and does:
    UPDATE account SET balance = 10 WHERE id = 123;
    COMMIT;

Technically speaking, running that in postgres in the default configuration (read committed) is prone to a race, and you'd need to use repeatable read or serializable mode to protect you.

But that's ridiculous. Anyone using SQL would instead do:

    CREATE TABLE account(id int8, balance numeric, check(balance >= 0));
    ...
    UPDATE account SET balance = balance - 90 WHERE id = 123;

And that is not prone to a race. Try it in any version of postgres, in any configuration. You can't get double withdrawls (where only one takes effect), and you can't get it to go below zero.

So, the author is technically right: (a) if you translate the NoSQL-isms into SQL in an unnatural way; and (b) don't bother to use SERIALIZABLE mode, which costs very little in most situations.

I agree with the author that isolation is tricky, and developers should not be expected to understand the nuances. And I applaud the development of a testing framework to really understand the various kinds of isolation and how they apply to different products. But the example is a bad one, because it actually does work just fine in postgres, and probably many other systems.

martinklOP11y ago

Your first ("very unnatural") example is what I had in mind. And since read committed is the default isolation level in most RDBMS, it is prone to the lost update anomaly. (In MySQL, repeatable read is the default, but its implementation of repeatable read doesn't prevent lost updates.) Note I did point out that I'm referring to the default configuration, not the strongest supported isolation level.

Although your second example is probably what a human would write, an ORM framework would very likely generate a transaction looking like your first example.

Another example would be inserting a transaction into a table, and summing the transactions in the account in order to calculate the account balance. Making that safe requires preventing phantom reads, which means requiring serializability.

You say serializable costs very little in most situations. I can't claim to know what most situations are like, but all I know is that I've seen many people who have tried serializable and found it too slow for them. User a-priori gives an example elsewhere on this thread: http://www.michaelmelanson.net/2014/03/20/transactions/

My point is that weak isolation is very subtle, easy to get wrong, and you don't know that you got it wrong until it's too late. We need better understanding and better tools so that concurrency is less easy to screw up.

jeffdavis11y ago

Your example of poor SERIALIZABLE performance is from a MySQL user. Have you tried SERIALIZABLE in PostgreSQL? The implementation is based on fairly recent research, and performs quite well for a lot more use cases.

And the point about ORMs is valid, but still not enough to back up your unqualified claim that using postgres would not have solve the problem. It might not have, but postgres offers a lot of tools to solve this problem (we didn't even discuss SELECT ... FOR UPDATE), and even inexperienced users are at least more likely to have stumbled into one of those solutions.

I very much agree that isolation issues are subtle traps for many users, even in SQL. I would like to see SERIALIZABLE become more common, and eventually the default, in postgres. I also like the fact that you're writing real tools to check up on these in a formal way.

But please be a little careful when making statements like that, because it can turn people away from the systems most likely to help them.

bmm6o11y ago

I strongly rebut the following claim, which is central to the article

I would disagree that offering an alternate solution that avoids the problem is a rebuttal of the fact that there are quirks in isolation level implementations across RDBMSs that can surprise even seasoned developers.

jeffdavis11y ago

The author specifically claimed that using a safer system like postgres would not have prevented the problem. But the evidence only says that it may not have prevented the problem.

It still looks to me like they would have stood a better chance with poatgres.

jrullmann11y ago· 3 in thread

Very cool - would love to see tests of FoundationDB!

FYI I'm an engineer at FDB, happy to help.

martinklOP11y ago

Awesome, I'd love a pull request! I've been looking at FoundationDB, but haven't had time to test it. Porting the tests to another database is (hopefully) a mostly mechanical exercise.

wwilson11y ago

Here's one way you could run this test vs. FoundationDB:

https://gist.github.com/MMcM/f00108c5943e919f73d1

jrullmann11y ago

Just submitted a pull request for FoundationDB: https://github.com/ept/hermitage/pull/1

Animats11y ago· 2 in thread

A key concept here is that real databases have transactions, there's some concurrency between transactions, and that in the more efficient modes, transactions can deadlock, fail, and be rolled back.

In MySQL, a statement within a transaction can return "ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction". This can be forced by the following sequence of events. (This assumes an InnoDB table in Repeatable Read mode.)

    Process A does a START TRANSACTION.
    Process B does a START TRANSACTION.
    Process A does a SELECT which reads row X.
    Process B does a SELECT which reads row X.  
    Process A does an UPDATE which writes row X.
    Process B does an UPDATE which writes row X.  - Deadlock error.

Process B gets a report that the transaction failed, and everything done in B's transaction is rolled back. The entire transaction has to be retried. Transactions are atomic if they commit, but can fail in a deadlock situation.

A SELECT does lock parts of the database. "Repeatable read" means that if you read the same data item twice within the same transaction, you get the same result, even if someone else is changing the data. This requires locking. If you try to update the data in conflict with another process, you'll get a deadlock error, but if you COMMIT a select-only transaction, you won't. You do have to COMMIT select-only transactions, or you'll fill memory with locks and stall out updates.

Ref: http://dev.mysql.com/doc/refman/5.1/en/innodb-lock-modes.htm...

martinklOP11y ago

What you describe is the Lost Update (P4) anomaly. It's in the test suite: https://github.com/ept/hermitage/blob/master/mysql.md#lost-u...

MySQL in repeatable read mode actually doesn't prevent this anomaly (it doesn't deadlock). You need to be in serializable mode to get the deadlock.

Animats11y ago

You're right. MS SQL Server does reject that on repeatable read, but MySQL does not.

Should MySQL have the same behavior? It detects the problem and blocks, but then does the update when the other transaction finishes, losing one update. I tried this for different values of "value" in each process, and it still fails. (The test sets the same value from each process, so you can't see who wins the race or if the database treated the update as a no-change transaction.)

1 more reply

a-priori11y ago· 2 in thread

In case it's interesting to people, here's a blog post I made earlier this year on the topic of what exactly transaction isolation means:

http://www.michaelmelanson.net/2014/03/20/transactions/

martinklOP11y ago

Nice example of a bug caused by weak isolation. FWIW, Postgres has an interesting implementation of "serializable" which takes far fewer locks than MySQL, so may give you better performance while retaining the same isolation level.

fdr11y ago

Yes: it is the relatively new approach by Cahill et al in 2008, known as SSI: https://courses.cs.washington.edu/courses/cse444/08au/544M/R...

Which, as far as production-common database implementations go, is lightspeed for implementing new academic work (9.1, the first version with the feature, was released in 2011).

wpietri11y ago

Interesting! One minor quibble:

> The idea of isolation is that we want our database to be able to process several transactions at the same time (otherwise it would be terribly slow)

Not necessarily true. Things like Prevayler and LMAX provide isolation by processing transactions one at a time, and they're very fast. They manage this by keeping everything relevant hot in RAM. LMAX, for example, can do 6 million TPS for a financial trading platform. You can read Martin Fowler writing about LMAX here: http://martinfowler.com/articles/lmax.html

fintler11y ago

I wonder what approach you could use for global safe timestamp snapshot transaction checking.

https://static.googleusercontent.com/media/research.google.c...

talles11y ago

There's been so much drugs articles in HN lately that I literally expected some LSD story here...

jackhulsom11y ago

Nice ! Very interesting.

amixofpersons11y ago

An interesting read.

j / k navigate · click thread line to collapse

33 comments

27 comments · 10 top-level

jeffdavis11y ago· 6 in thread

The author didn't really show that this was true. He obviously understands isolation well, so I would tend to believe him, but it would be nice to see an example.

In a simple case (not sure that it matches the exchange's case), postgresql in any configuration will prevent this problem:

    CREATE TABLE account(
      id int8,
      balance numeric, check (balance >= 0)
    );

Does someone have a counterexample, or more details about the case at the exchange that would not be solved by postgres?

I'd also like to point out that the postgres's implementation of true serializability performs quite well and there isn't much of a cost to using it over snapshot isolation.

jrapdx311y ago

The key phrase is the line quoted is "default configuration". According to PostgreSQL docs, the default isolation level is "READ COMMITTED".

Your example check constraint indeed appears to work as advertised. Trying

    UPDATE account SET balance = -1.00 where id = 1;

gives an ERROR message re: new row for relation "account" violates check constraint ...

Still it's not clear exactly what the author sees re: where postgres fails. I'll have to read the article again. With any luck someone more knowledgeable than I am will shed light on the question.

jasonwatkinspdx11y ago

In short: prior to version 9.2, postgres's "serializable" level still allowed some transaction patterns that would be rejected by a strictly serial execution. Later versions prevent it.

This paper: http://dl.acm.org/citation.cfm?id=1376690 outlines the general issue. This paper discusses Postgres's solution: http://drkp.net/papers/ssi-vldb12.pdf

jasonwatkinspdx11y ago

It's typically called Write Skew. Wikipedia discusses it on the snapshot isolation page (http://en.wikipedia.org/wiki/Snapshot_isolation).

Also, Peter Bailis's blog has info and links to papers if you want more detail (http://www.bailis.org/blog/when-is-acid-acid-rarely/)

jeffdavis11y ago

I know, in general, about various kinds of serialization anomalies, and I know (in moderate detail) how all of the isolation modes are implemented in postgres.

1 more reply

wbkang11y ago

jeffdavis11y ago

But that will not happen in postgres on any version in any configuration with any isolation level. It just can't happen.

1 more reply

jeffdavis11y ago· 4 in thread

I strongly rebut the following claim, which is central to the article:

I hesitate because I don't really understand the details of the situation the exchange faced. But, going by the linked references here:

https://bitcointalk.org/index.php?topic=499580 http://www.reddit.com/r/Bitcoin/comments/1wtbiu/how_i_stole_...

it appears that the pattern in question, if translated very unnaturally to SQL, is something like:

    CREATE TABLE account(id int8, balance numeric);
    ...
    BEGIN;
    SELECT balance FROM account WHERE id = 123;
    -- application sees 100, subtracts 90, sees that
    -- it's still positive and does:
    UPDATE account SET balance = 10 WHERE id = 123;
    COMMIT;

Technically speaking, running that in postgres in the default configuration (read committed) is prone to a race, and you'd need to use repeatable read or serializable mode to protect you.

But that's ridiculous. Anyone using SQL would instead do:

    CREATE TABLE account(id int8, balance numeric, check(balance >= 0));
    ...
    UPDATE account SET balance = balance - 90 WHERE id = 123;

And that is not prone to a race. Try it in any version of postgres, in any configuration. You can't get double withdrawls (where only one takes effect), and you can't get it to go below zero.

So, the author is technically right: (a) if you translate the NoSQL-isms into SQL in an unnatural way; and (b) don't bother to use SERIALIZABLE mode, which costs very little in most situations.

martinklOP11y ago

Although your second example is probably what a human would write, an ORM framework would very likely generate a transaction looking like your first example.

jeffdavis11y ago

But please be a little careful when making statements like that, because it can turn people away from the systems most likely to help them.

bmm6o11y ago

I strongly rebut the following claim, which is central to the article

jeffdavis11y ago

The author specifically claimed that using a safer system like postgres would not have prevented the problem. But the evidence only says that it may not have prevented the problem.

It still looks to me like they would have stood a better chance with poatgres.

jrullmann11y ago· 3 in thread

Very cool - would love to see tests of FoundationDB!

FYI I'm an engineer at FDB, happy to help.

martinklOP11y ago

Awesome, I'd love a pull request! I've been looking at FoundationDB, but haven't had time to test it. Porting the tests to another database is (hopefully) a mostly mechanical exercise.

wwilson11y ago

Here's one way you could run this test vs. FoundationDB:

https://gist.github.com/MMcM/f00108c5943e919f73d1

jrullmann11y ago

Just submitted a pull request for FoundationDB: https://github.com/ept/hermitage/pull/1

Animats11y ago· 2 in thread

A key concept here is that real databases have transactions, there's some concurrency between transactions, and that in the more efficient modes, transactions can deadlock, fail, and be rolled back.

    Process A does a START TRANSACTION.
    Process B does a START TRANSACTION.
    Process A does a SELECT which reads row X.
    Process B does a SELECT which reads row X.  
    Process A does an UPDATE which writes row X.
    Process B does an UPDATE which writes row X.  - Deadlock error.

Ref: http://dev.mysql.com/doc/refman/5.1/en/innodb-lock-modes.htm...

martinklOP11y ago

What you describe is the Lost Update (P4) anomaly. It's in the test suite: https://github.com/ept/hermitage/blob/master/mysql.md#lost-u...

MySQL in repeatable read mode actually doesn't prevent this anomaly (it doesn't deadlock). You need to be in serializable mode to get the deadlock.

Animats11y ago

You're right. MS SQL Server does reject that on repeatable read, but MySQL does not.

1 more reply

a-priori11y ago· 2 in thread

In case it's interesting to people, here's a blog post I made earlier this year on the topic of what exactly transaction isolation means:

http://www.michaelmelanson.net/2014/03/20/transactions/

martinklOP11y ago

fdr11y ago

Yes: it is the relatively new approach by Cahill et al in 2008, known as SSI: https://courses.cs.washington.edu/courses/cse444/08au/544M/R...

Which, as far as production-common database implementations go, is lightspeed for implementing new academic work (9.1, the first version with the feature, was released in 2011).

wpietri11y ago

Interesting! One minor quibble:

> The idea of isolation is that we want our database to be able to process several transactions at the same time (otherwise it would be terribly slow)

fintler11y ago

I wonder what approach you could use for global safe timestamp snapshot transaction checking.

https://static.googleusercontent.com/media/research.google.c...

talles11y ago

There's been so much drugs articles in HN lately that I literally expected some LSD story here...

jackhulsom11y ago

Nice ! Very interesting.

amixofpersons11y ago

An interesting read.

j / k navigate · click thread line to collapse