Databases at 14.4Mhz (opens in new tab)

(blog.foundationdb.com)

297 pointschton11y ago81 comments

81 comments

54 comments · 15 top-level

hendzen11y ago· 9 in thread

This is very impressive, however...

See this tweet by @aphyr: https://twitter.com/aphyr/status/542755074380791809

(All credit for the idea in this comment is due to @aphyr)

Basically because the transactions modified keys selected from a uniform distribution, the probability of contention was extremely low. AKA this workload is basically a data-parallel problem, somewhat lessening the impressiveness of the high throughput. Would be interesting to see it with a Zipfian distribution (or even better, a Biebermark [0])

[0] - http://smalldatum.blogspot.co.il/2014/04/biebermarks.html

Dave_Rosenthal11y ago

Actually, @aphyr's analysis is wrong. In fact it's actually worse that he says. There is an exactly 0% chance of conflicts between two transactions that each only write 20 random keys (as they do in this test). This is perhaps counterintuitive, but, because there are no reads, any serialization order is possible and therefore there is no chance of conflict.

Equally wrong is his assumption that this has any bearing on the performance of FoundationDB. In fact FoundationDB will perform the same whether the conflict rate is high or low. This isn't to say that FoundationDB somehow cheats the laws of transaction conflicts, just that it has to do all the work in either case. There is no trick or cheat on this test--this same performance will hold on a variety of workloads with varying conflict rates as well as those including reads.

fry_the_guy11y ago

Your definition of performance is a little weird.

Presumably you will want to retry conflicting transactions, so you generally would not count them towards your throughput.

For example if I commit 100 transactions per second, and 90% of them return conflicts, I am only successfully committing 10 transactions per second.

Dave_Rosenthal11y ago

That's true: a large conflict rate would eat into your budget because clients would be retrying.

Of course most real world workloads are (hopefully!) no where near 90% conflicts. If you had a 90% transaction conflict rate you could expect FoundationDB performance to drop by about 30-60% due to retries (and, worse news, you would need to rethink your architecture).

dmitrig0111y ago

I'm not an expert in this, but:

because there are no reads, any serialization order is possible and therefore there is no chance of conflict

I don't understand how this is the case. If clients A and B try to write to the same key, it can be serialized as {A,B} or {B,A}, but in either case, there is some kind of conflict... no?

Dave_Rosenthal11y ago

Nope. Not unless someone actually read that same key in their transaction. Then, with optimistic concurrency at least, you might get a conflict at commit time that basically says "Hey, someone changed that key you read in the meantime so your write might not be correct anymore".

1 more reply

wmf11y ago

Unfortunately this problem isn't specific to FoundationDB; the old "industry-standard" TPC-C benchmark has a similar low-contention design which has led to years of unrepresentative performance tuning and benchmarketing.

imanaccount24711y ago

How is it unrepresentative? TPC-C involves several queries simulating new orders being entered, being updated, filled, paid, etc. It is very representative of the workload it is simulating. If you want to simulate a different workload, pick one of the other benchmarks instead of C.

jrallison11y ago

Absolutely. Choosing the proper data model based on your access patterns is one of the best ways of keeping FDB performance high by reducing the likelihood of contention.

vosper11y ago

Yeah, this would have been more interesting if they could tell us what the throughput is given various levels of contention.

lttlrck11y ago· 7 in thread

"Or, as I like to say, 14.4Mhz."

Sorry, I don't like that at all.

ctdonath11y ago

Inclined to agree. Title made me think it was a retro-computing analysis of running dBASE II on an IBM PC/AT or some such.

simcop238711y ago

I would agree. While 14.4 million writes per second is impressive, it definitely doesn't fit the definition of the Hz unit (cycles per second).

jrochkind111y ago

"Hertz" just means "per second". It can be anything per second, there are no units in the numerator. It just a measure of frequency, anything per second.

http://en.wikipedia.org/wiki/Hertz

If anything you are counting is happening 14.4 million times a second, then that thing is happening at 14 megahertz. It's just a unit. If there are 14.4 million writes per second, then the writes are occuring at 14.4 mega hertz. That's just the definition of hertz.

Haha, Google agrees, check it out: https://www.google.com/?q=4+per+second#q=4+per+second

ctdonath11y ago

"Hertz" just means "cycles per second".

The reference is to frequency, not average or typical rate; there's a difference between a tone and pink noise.

1 more reply

SpeckledJim191711y ago

Right, because a database write is not generally a periodic event. The unit they want is the becquerel - 14.4MBq might even sound more impressive because not so many people are familiar with that unit!

dj-wonk11y ago

I agree that Hz seems to imply a periodic event.

But I disagree that the Bq is appropriate for this example; the becquerel quantifies radioactivity:

> The becquerel (symbol Bq) (pronounced: 'be-kə-rel) is the SI derived unit of radioactivity. One Bq is defined as the activity of a quantity of radioactive material in which one nucleus decays per second. (https://en.wikipedia.org/wiki/Becquerel)

1 more reply

saeguaiga11y ago

I am still trying to push, pull or twist my concept of an ultralow-power mcu that runs 14.4MHz (or at nearly 14mA drain, 32MHz) in order to have it not only perform its optimal max of 1 I/O per cycle but do a simultaneous random read and write. Values are 1kb and keys 16 bytes, though... (As opposed to using the 1000 Amazon vCPUs each ~1.0 Xeon core with 4 hyperthreads (see http://www.pythian.com/blog/virtual-cpus-with-amazon-web-ser... )) That would be a pretty fat memory model (or HMC controller) for a micro...

dchichkov11y ago· 5 in thread

I'm only familiar with other key-value storage engines, not FoundationDB, but it seems like the goals are: "distributed key-value database, read latencies below 500 microseconds, ACID, scalability".

I remember evaluating a few low latency key-value storage solutions, and one of these was Stanford's RAMCloud, which is supposed to give 4-5 microseconds reads, 15 microseconds writes, scale up to 10,000 boxes and provide data durability. https://ramcloud.atlassian.net/wiki/display/RAM/RAMCloud Seems like, that would be "Databases at 2000Mhz".

I've actually studied the code that was handling the network and it had been written pretty nicely, and as far as I know, it should work both over 10Gbe and Infiniband with similar latencies. And I'm not at all surprised, they could get pretty clean looking 4-5us latency distribution, with the code like that.

How does it compare with FoundationDB? Is it completely different technology?

jrallison11y ago

Don't know much about RAMCloud, but from the description:

"Many more issues remain, such as whether we can provide higher-level features such as secondary indexes and multiple-object transactions without sacrificing the latency or scalability of the system. We are currently exploring several of these issues."

Sounds it doesn't provide multi-key ACID transactions at the very least.

1 more reply

imaginenore11y ago

But it doesn't store the data on disk.

It should be compared to Memcached, not to a real storage engine.

dchichkov11y ago

Unlike Memcached or Redis, RAMCloud provides durability:

  * Automatic replication, crash recovery, and fail-over 
    (no loss of availability if a server fails)
  * Durability guarantee: data is always replicated 
    and durable before operations return, without
    significant performance hit (subject to the 
    requirement for persistent buffers on backups).

grogers11y ago

You can't make a durable write to disk in 15us. SSDs are around 2 orders of magnitude slower than that. 15us is about right for in memory on two nodes in a single data center though. Most people wouldn't consider that durable though.

imaginenore11y ago

So? Power goes out in your data center, all the data is lost.

Nothing stops you from replicating memcached.

shortstuffsushi11y ago· 4 in thread

As someone who has no idea about the cost of high-scale computing like this, is $150/hr reasonable? It seems like an amount that's hard to sustain to me, but I have no idea if that's a steady, all the time rate, or a burst rate, or what. Or if it's a set up you'd actually ever even need -- seems like from the examples they mention (like the Tweets), they're above the need by a fair amount. Anyone else in this sort of situation care to chip in on that?

landryraccoon11y ago

Per the article, it's about 1/20th the cost that Google would charge you for the same number of writes.

shortstuffsushi11y ago

I had seen that as well, actually. Is this the kind of data or transaction level that might only be expected for a Google level throughput?

emn1311y ago

Well, do you do 14.4 million serializable writes per second or anything close to that?

1 more reply

primitivesuave11y ago

There are literally thousands of companies that each hand over millions of dollars to Oracle on a semi-regular basis, for enormous server setups to store enormous data sets. This is a major improvement.

jrallison11y ago· 3 in thread

We've been using FoundationDB in production for about 10 months now. It's really been a game changer for us.

We continue to use it for more and more data access patterns which require strong consistency guarantees.

We currently store ~2 terabytes of data in a 12 node FDB cluster. It's rock solid and comes out of the box with great tooling.

Excited about this release! My only regret is I didn't find it sooner :)

lclarkmichalek11y ago

Why do you need a 12 node cluster for 2 TB of data?

jrallison11y ago

With a replication factor of 2 (for fault tolerance), it's ~4.5 TB.

FoundationDB requires SSD drives, which the best we can get efficiently in our data center is ~670 GB of usable space (3x480GB raided).

12x670 = 8040 GB

We try to keep extra space available for node failures (FDB will immediately start replicating addition data if it notices data with less than the configured number of replicas).

Our dataset currently grows at a decent pace, so we over provision a bit as well.

jedberg11y ago

Why do you use RAID on your nodes if you have an RF==2?

2 more replies

bsaul11y ago· 3 in thread

Just watched the linked presentation about "flow" here : https://foundationdb.com/videos/testing-distributed-systems-...

Is it really the first Distributed DB project to have built a simulator ?

Because frankly, if that's the case, it seems revolutionary to me. Intuitively, it seems like bringing the same kind of quality improvement as unit testing did to regular software development.

PS : i should add that this talk is one of the best i've seen this year. The guy is extremely smart, passionate, and clear. (i just loved the The Hurst exponent part).

jchrisa11y ago

At Couchbase we have a plethora of simulators. Simulations of cluster topology changes, simulations of failure scenarios, simulations of workloads to estimate requried cluster sizes, etc. Here's one: https://github.com/couchbaselabs/cbfg

Dave_Rosenthal11y ago

I think the difference is that FoundationDB has only one, and it's not an external simulation. The actual code that runs in production can also deterministically simulate a cluster of itself.

I do believe this is unique among publically-available distributed databases.

anonymousDan11y ago

Why is it a good thing that it's not an external simulation?

2 more replies

felixgallo11y ago· 3 in thread

This looks very interesting and congratulations to the FoundationDB crew on some pretty amazing performance numbers.

One of the links leads to an interesting C++ actor preprocessor called 'Flow'. In that table, it lists the performance result of sending a message around a ring for a certain number of processes and a certain number of messages, in which Flow appears to be fastest with 0.075 sec in the case of N=1000 and M=1000, compared with, e.g. erlang @ 1.09 seconds.

My curiosity was piqued, so I threw together a quick microbenchmark in erlang. On a moderately loaded 2013 macbook air (2-core i7) and erlang 17.1, with 1000 iterations of M=1000 and N=1000, it averaged 34 microseconds per run, which compares pretty favorably with Flow's claimed 75000 microseconds. The Flow paper appears to maybe be from 2010, so it would be interesting to know how it's doing in 2014.

emn1311y ago

There's got to be some mistake there somewhere - on your part, or on theirs - because there's no way erlag improved from 1.09 seconds to 34 microseconds on pretty much any benchmark between 2010 and 2014. Even the factor 1000 (the message count) isn't enough to account for that difference - something's fishy.

Dave_Rosenthal11y ago

1M messages passes in 34 microseconds would be ~0.1 CPU clocks per message pass. Maybe an optimizer is stepping in and short-circuiting?

felixgallo11y ago

heh, not only that, but my fingers accidentally made my test massively concurrent! Such are the perils of erlang. When I defeat the optimizer and also force message passing to be serialized both inside and outside a trial, I get an average of 525000 microseconds, which is more in line with the other results.

ChuckMcM11y ago· 2 in thread

This is Foundation DB's announcement they are doing full ACID databases with a 14.4M writes per second capability. That is insanely fast in the data base world. Running in AWS with 32 c3.8xlarge configured machines. So basically NSA level data base capability for $150/hr. But perhaps more interesting is that those same machines on the open market are about $225,000. That's two rack, one switch and a transaction rate that lets you watch every purchase made at every Walmart store in the US, in real time. That is assuming the stats are correct[1], and it wouldn't even be sweating (14M customers a day vs 14M transactions per second). Insanely fast.

I wish I was an investor in them.

[1] http://www.statisticbrain.com/wal-mart-company-statistics/

vilda11y ago

Not all ACID transactions are equal. This is just a key-value store-like test. It shows the potential to scale, yet nothing regarding performance in real word.

32 c3.2xlarge instances have 1920GB memory. Given 1 billion 16B+ 8..100B values the whole dataset fits just into memory.

The Cassandra test mentioned [1] sustained loss of 1/3 instances. That's very impressive! Would love to see how F-DB handles this type of real-life situation (hint hint for follow up blog post).

[1] http://googlecloudplatform.blogspot.cz/2014/03/cassandra-hit...

jbergens11y ago

They do seem to talk about writes per second. That means you must flush to disk, not read from cache.

[Edit, more info] They seem to run with multiple clients which also stresses the system. From their explanation * The clients simulate 320,000 concurrent sessions

maliki11y ago· 2 in thread

This sounds great compared to my anecdotal experience with DB write performance; but is there a collection of database performance benchmarks that this can be easily compared to?

The best source for DB benchmarking I know of is http://www.tpc.org/. The methodology is more complicated there, but the top results are around 8 million transactions per minute on $5 million systems. This FoundationDB result is more like 900 million transactions per minute on a system that costs $1.5 million a year to rent (so, approx $5 million to buy?).

The USD/transactions-per-minute metric is clear, but without a standard test suite (schema, queries, client count, etc.), comparing claims of database performance makes my head hurt.

Dave_Rosenthal11y ago

I think you mean "900 million transactions per minute". Of course that overstates things since each TPC-C transaction entails a lot more than one write. TPC-C is about 2/3 read and 1/3 write, and each TPC transaction might do 20 low-level read+write operations (I'm actually not sure, but I think that's in the ballpark.)

In the NoSQL world many people have converged on a workload of 90% reads/10% writes to individual keys. We show 90/10 results on our performance page [1] but in this test we do 100% writes to stress the "transaction engine", which processes writes.

Since we have our SQL Layer [2] as well, we will run some more-comparable SQL tests in the future.

[1] https://foundationdb.com/key-value-store/performance

[2] https://foundationdb.com/layers/sql

maliki11y ago

(Ah, yes, thank you, 900 million per minute. I typo'd.)

Nice performance page. It pre-answers my follow up question, which was how linear is scaling with more cores? Looks solid.

imanaccount24711y ago· 1 in thread

Why the deliberately misleading comparisons? If you are doing something genuinely impressive, then you should be able to be honest about it and have it still seem impressive. One tweet is not one write. Comparing tweets per second to writes per second is complete nonsense. How many writes a tweet causes depends on how many followers the person who is tweeting has. The 100 writes per second nonsense is even worse. Do you just think nobody is old enough to have used a database 15 years ago? 10,000 writes per second was no big deal on of the shelf hardware of the day, nevermind on an actual server.

jrochkind111y ago

it's just a comparison to give you a sense of what those numbers look like, order of magnitude comparisons. I didn't find it misleading, and didn't think it meant that you'd be able to run twitter on their db on one commodity server. But it's an order of magnitude estimate to give you a sense of the scale.

w8rbt11y ago

I thought it was DB connections over radio waves just above 20 meters. Also, it's MHz, not Mhz.

illumen11y ago

Impressive.

However I think there's still plenty of room to grow.

320,000 concurrent sessions isn't that much by modern standards. You can get 12 million concurrent connections on one linux machine, and push 1gigabit of data.

Also, 167 megabytes per second (116B * 14.4 million) is not pushing the limits of what one machine can do. I've been able to process 680 megabytes per second of data into a custom video database, plus write it to disk on one 2010 machine. That's doing heavy processing at the same time on the video with plenty of CPU to spare.

PCIe over fibre can do many transactions messages per second. You can fit 2TB memory machines in 1U (and more).

Since this is a memory + eventually dump to disk database, I think there is still a lot of room to grow.

mariusz7911y ago

MHz not Mhz.

tuyguntn11y ago

I wish I would work with such great professionals as a Junior for 10years, right after school!!!

oconnor66311y ago

Any chance of an open source release for the database core? :)

j / k navigate · click thread line to collapse

81 comments

54 comments · 15 top-level

hendzen11y ago· 9 in thread

This is very impressive, however...

See this tweet by @aphyr: https://twitter.com/aphyr/status/542755074380791809

(All credit for the idea in this comment is due to @aphyr)

[0] - http://smalldatum.blogspot.co.il/2014/04/biebermarks.html

Dave_Rosenthal11y ago

fry_the_guy11y ago

Your definition of performance is a little weird.

Presumably you will want to retry conflicting transactions, so you generally would not count them towards your throughput.

For example if I commit 100 transactions per second, and 90% of them return conflicts, I am only successfully committing 10 transactions per second.

Dave_Rosenthal11y ago

That's true: a large conflict rate would eat into your budget because clients would be retrying.

dmitrig0111y ago

I'm not an expert in this, but:

because there are no reads, any serialization order is possible and therefore there is no chance of conflict

I don't understand how this is the case. If clients A and B try to write to the same key, it can be serialized as {A,B} or {B,A}, but in either case, there is some kind of conflict... no?

Dave_Rosenthal11y ago

1 more reply

wmf11y ago

imanaccount24711y ago

jrallison11y ago

Absolutely. Choosing the proper data model based on your access patterns is one of the best ways of keeping FDB performance high by reducing the likelihood of contention.

vosper11y ago

Yeah, this would have been more interesting if they could tell us what the throughput is given various levels of contention.

lttlrck11y ago· 7 in thread

"Or, as I like to say, 14.4Mhz."

Sorry, I don't like that at all.

ctdonath11y ago

Inclined to agree. Title made me think it was a retro-computing analysis of running dBASE II on an IBM PC/AT or some such.

simcop238711y ago

I would agree. While 14.4 million writes per second is impressive, it definitely doesn't fit the definition of the Hz unit (cycles per second).

jrochkind111y ago

"Hertz" just means "per second". It can be anything per second, there are no units in the numerator. It just a measure of frequency, anything per second.

http://en.wikipedia.org/wiki/Hertz

Haha, Google agrees, check it out: https://www.google.com/?q=4+per+second#q=4+per+second

ctdonath11y ago

"Hertz" just means "cycles per second".

The reference is to frequency, not average or typical rate; there's a difference between a tone and pink noise.

1 more reply

SpeckledJim191711y ago

dj-wonk11y ago

I agree that Hz seems to imply a periodic event.

But I disagree that the Bq is appropriate for this example; the becquerel quantifies radioactivity:

1 more reply

saeguaiga11y ago

dchichkov11y ago· 5 in thread

I'm only familiar with other key-value storage engines, not FoundationDB, but it seems like the goals are: "distributed key-value database, read latencies below 500 microseconds, ACID, scalability".

How does it compare with FoundationDB? Is it completely different technology?

jrallison11y ago

Don't know much about RAMCloud, but from the description:

Sounds it doesn't provide multi-key ACID transactions at the very least.

1 more reply

imaginenore11y ago

But it doesn't store the data on disk.

It should be compared to Memcached, not to a real storage engine.

dchichkov11y ago

Unlike Memcached or Redis, RAMCloud provides durability:

  * Automatic replication, crash recovery, and fail-over 
    (no loss of availability if a server fails)
  * Durability guarantee: data is always replicated 
    and durable before operations return, without
    significant performance hit (subject to the 
    requirement for persistent buffers on backups).

grogers11y ago

imaginenore11y ago

So? Power goes out in your data center, all the data is lost.

Nothing stops you from replicating memcached.

shortstuffsushi11y ago· 4 in thread

landryraccoon11y ago

Per the article, it's about 1/20th the cost that Google would charge you for the same number of writes.

shortstuffsushi11y ago

I had seen that as well, actually. Is this the kind of data or transaction level that might only be expected for a Google level throughput?

emn1311y ago

Well, do you do 14.4 million serializable writes per second or anything close to that?

1 more reply

primitivesuave11y ago

jrallison11y ago· 3 in thread

We've been using FoundationDB in production for about 10 months now. It's really been a game changer for us.

We continue to use it for more and more data access patterns which require strong consistency guarantees.

We currently store ~2 terabytes of data in a 12 node FDB cluster. It's rock solid and comes out of the box with great tooling.

Excited about this release! My only regret is I didn't find it sooner :)

lclarkmichalek11y ago

Why do you need a 12 node cluster for 2 TB of data?

jrallison11y ago

With a replication factor of 2 (for fault tolerance), it's ~4.5 TB.

FoundationDB requires SSD drives, which the best we can get efficiently in our data center is ~670 GB of usable space (3x480GB raided).

12x670 = 8040 GB

We try to keep extra space available for node failures (FDB will immediately start replicating addition data if it notices data with less than the configured number of replicas).

Our dataset currently grows at a decent pace, so we over provision a bit as well.

jedberg11y ago

Why do you use RAID on your nodes if you have an RF==2?

2 more replies

bsaul11y ago· 3 in thread

Just watched the linked presentation about "flow" here : https://foundationdb.com/videos/testing-distributed-systems-...

Is it really the first Distributed DB project to have built a simulator ?

Because frankly, if that's the case, it seems revolutionary to me. Intuitively, it seems like bringing the same kind of quality improvement as unit testing did to regular software development.

PS : i should add that this talk is one of the best i've seen this year. The guy is extremely smart, passionate, and clear. (i just loved the The Hurst exponent part).

jchrisa11y ago

Dave_Rosenthal11y ago

I think the difference is that FoundationDB has only one, and it's not an external simulation. The actual code that runs in production can also deterministically simulate a cluster of itself.

I do believe this is unique among publically-available distributed databases.

anonymousDan11y ago

Why is it a good thing that it's not an external simulation?

2 more replies

felixgallo11y ago· 3 in thread

This looks very interesting and congratulations to the FoundationDB crew on some pretty amazing performance numbers.

emn1311y ago

Dave_Rosenthal11y ago

1M messages passes in 34 microseconds would be ~0.1 CPU clocks per message pass. Maybe an optimizer is stepping in and short-circuiting?

felixgallo11y ago

ChuckMcM11y ago· 2 in thread

I wish I was an investor in them.

[1] http://www.statisticbrain.com/wal-mart-company-statistics/

vilda11y ago

Not all ACID transactions are equal. This is just a key-value store-like test. It shows the potential to scale, yet nothing regarding performance in real word.

32 c3.2xlarge instances have 1920GB memory. Given 1 billion 16B+ 8..100B values the whole dataset fits just into memory.

The Cassandra test mentioned [1] sustained loss of 1/3 instances. That's very impressive! Would love to see how F-DB handles this type of real-life situation (hint hint for follow up blog post).

[1] http://googlecloudplatform.blogspot.cz/2014/03/cassandra-hit...

jbergens11y ago

They do seem to talk about writes per second. That means you must flush to disk, not read from cache.

[Edit, more info] They seem to run with multiple clients which also stresses the system. From their explanation * The clients simulate 320,000 concurrent sessions

maliki11y ago· 2 in thread

This sounds great compared to my anecdotal experience with DB write performance; but is there a collection of database performance benchmarks that this can be easily compared to?

The USD/transactions-per-minute metric is clear, but without a standard test suite (schema, queries, client count, etc.), comparing claims of database performance makes my head hurt.

Dave_Rosenthal11y ago

Since we have our SQL Layer [2] as well, we will run some more-comparable SQL tests in the future.

[1] https://foundationdb.com/key-value-store/performance

[2] https://foundationdb.com/layers/sql

maliki11y ago

(Ah, yes, thank you, 900 million per minute. I typo'd.)

Nice performance page. It pre-answers my follow up question, which was how linear is scaling with more cores? Looks solid.

imanaccount24711y ago· 1 in thread

jrochkind111y ago

w8rbt11y ago

I thought it was DB connections over radio waves just above 20 meters. Also, it's MHz, not Mhz.

illumen11y ago

Impressive.

However I think there's still plenty of room to grow.

320,000 concurrent sessions isn't that much by modern standards. You can get 12 million concurrent connections on one linux machine, and push 1gigabit of data.

PCIe over fibre can do many transactions messages per second. You can fit 2TB memory machines in 1U (and more).

Since this is a memory + eventually dump to disk database, I think there is still a lot of room to grow.

mariusz7911y ago

MHz not Mhz.

tuyguntn11y ago

I wish I would work with such great professionals as a Junior for 10years, right after school!!!

oconnor66311y ago

Any chance of an open source release for the database core? :)

j / k navigate · click thread line to collapse