Looking into the future with Cassandra (opens in new tab)

(blog.digg.com)

126 pointsbnmrrs16y ago40 comments

40 comments

27 comments · 8 top-level

justin_vanw16y ago· 6 in thread

SELECT `digdate`, `id` FROM `Diggs` WHERE `userid` IN (59, 9006, 15989, 16045, 29183, 30220, 62511, 75212, 79006) AND itemid = 13084479 ORDER BY `digdate` DESC, `id` DESC LIMIT 4;

Ok, how do we optimize this query?

Step 1: Keep all dugg items in memcached for the last N days, where N is determined by when you run out of memory. Then, your query becomes:

SELECT `digdate`, `id` FROM `Diggs` WHERE `userid` IN (59, 9006, 15989, 16045, 29183, 30220, 62511, 75212, 79006) AND digdate < now() - interval '5 days' AND itemid = 13084479 ORDER BY `digdate` DESC, `id` DESC LIMIT 4; /* Excuse the postgresql syntax /

If your database is properly clustered, this will mean you are only running the query against partitions holding old dugs, which is probably not as hot as the more recent stuff. Additionally, I strongly suspect that you see more recent articles more than old ones, if the article is less than 5 days old you need no SQL at all, just the memcache lookup. For example, if you are looking at the homepage, and there are 15 articles on it, you have to do a single memcached get request for all the pairs like (article_id, friend_id), so if you have 100 friends that is 100

15 keys to request. This is large, but who cares, you can add memcached servers and webservers until you puke and this will keep scaling without limit. When browsing old articles the db will get hit heavily, but only the partitions holding old data, and I would guess that this is a very very small fraction of their overall use.

Step 2: When a user is activly using the site, like they have viewed 2 pages in the last 10 minutes or something, shove all their old (article_id, friend_id) pairs into memcached as well. Once a user has reached the 'activity threshhold' and the cache is filled, no sql is necessary to find all their friend's dug articles. As a bonus, no weirdo software like 'cassandra' which may or may not continue to exist in 1 year is necessary.

For step 1 you need very little effort, just put a key into memcached every time a user digs something, and put a 5 day timeout on that key. This is 1 line of code in whatever code handles the http request representing a 'dig'. Then you have to build up the list of friends and keep it somewhere when a user logs in to the site (or returns with a cookie that has them logged in). This would take one memcache request when the user logs in/comes back to see if their friends list is in memcached, one sql statement if it is not, and a line in the are that handles adding friends to spoil the key if their friends list changes (you could try updating it, but why, just let it be regenerated on their next http request). Finally, you have to generate the keys for the (article_id, friend_id) pairs on each page view, and do a multi_get from memcached.

Step 2 would require an asynchronous process, so would be more complex.

I could implement step 1 in an hour or so if familiar with the digg codebase, and step 2 in perhaps 2 days, however if they have other async processes that occur when a user logs in that you could integrate this with it could take as little as an hour or two as well, since the logic is dead simple, it is the mechanics of running a process to do it that is time consuming.

Finally, you would have to figure out how much memory you would need to store N days of digs (users with no friends do not count in this). I believe it would not be very much.

ojbyrne16y ago

Since I wrote that query back in 2005, and it was actually somewhat more complex (had to deal with users banned after they dug the same story as you for starters) I assume Ian (whom I've never met) is simplifying for pedagogical reasons. And we had something similar to what you describe in no. 2 in 2006.

ojbyrne16y ago

Also I find it fascinating that the table is "Diggs" (it was originally "digs" and I suspect it still is). Gotta get the branding right, even in sql queries.

stephenjudkins16y ago

When a Memcached instance goes under, all Diggs stored on that instance in the last N days disappear. Not the end of the world for this application, but very undesirable for most things. You could potentially "fall back" to MySQL for this but a workable strategy for that probably isn't simple.

Further, step two is irrelevant. There's no way of knowing if a (article_id, friend_id) pair in memcached is from the most recent N days or whether it's been stuffed in due to a user being active. Therefore, searching the DB for older diggs is still necessary, and should take the exact same amount of load as if they weren't in memcached at all.

Memcached + MySQL makes great sense when the data set is small and simple. If all the content on the site fits within 1 GB you could probably easily push a hundred millions unique visitors a day. For an application like this, the relatively poor performance of MySQL and the inflexibility of Memcached cause problems.

It seems to me that using Cassandra, even in its current immature state, makes much more sense than the solution you're proposing.

justin_vanw16y ago

Ahh, you make a great point, that if an instance goes down you lose those keys. Cassandra doesn't necessarily solve this, however, unless you have enough failover that it is unthinkable that you could have cassandra lose data. If you can rebuild the cache in cassandra you can rebuild it in the identical way using my scheme, but my scheme only requires N days of data to be run through.

Step 2 is fine, memcached allows setting of timeouts on keys. If you always set a timeout of (dig_time + five_days) - now() you are set. Be careful, times > 30 days imply a unix timestamp not a timeout.

Mysql kindof sucks. The clustering is easy to set up at least.

Cassandra and other big on disk hash tables are pretty cool. I think once they have datastore-like indexing capabilities they will be totally usable. My qualm with using them is that there are 80 of them right now, and they are all pretty immature. The ones that are mature, like bsddb, and complicated to use.

My point was that cassandra is just a big memcached.

1 more reply

imajes16y ago

Sure.

but NoSQL kinda gives you memcache like awesomeness with some level of persistence. Think of cassandra as a more feature-rich, persistent memcache. :)

justin_vanw16y ago

No. NoSQL gives you memcached on a disk, which sucks a lot. SQL is good.

1 more reply

mcav16y ago· 5 in thread

Wow:

> For this feature, the fully denormalized Cassandra dataset weighs in at 3 terabytes and 76 billion columns.

breck16y ago

What's amazing is that 3 terabyes costs about $300 nowadays. Include good components, power, redundancy, maintenance, and it probably comes to $1,000 per year.

So this feature is actually pretty cheap to do, unless I'm missing something.

SwellJoe16y ago

Every time I read "terabyte", I have to pause, because my brain goes "That's gigantic!" and then a little voice says, "It's the same size as the hard disk in my desktop...it cost $89."

I have the same sort of weird mental disconnect with SD cards. It's just hard to fathom 32GB of data fitting into something about the size of a quarter.

2 more replies

jbellis16y ago

Serving up thousands of requests per seconds from 3 TB (while continually adding and updating more data!) is a lot different from just archiving it.

A _lot_. :)

To put it in perspective, 100GB is pretty much the limit of what you can serve from a single mysql machine, for instance. This will vary depending on exact workload but that is a reasonable ball park number.

2 more replies

joechung16y ago

3 TB of disk is easy. 3 TB flowing quickly from disk to RAM to CPU cache and back out again is another story.

1 more reply

pgebhard16y ago

I was thinking "WOW", as well, but breck's right. This shouldn't really be that surprising any more given the price of disk storage these days.

rarrrrrr16y ago· 2 in thread

In SQL systems, you can easily shift the computation cost from at read time to at write time. That's what triggers are for. When new data comes in that changes a result you know you need quickly, a trigger can automatically add to a work queue to update a result table.

Since the trigger has automatic access to the contents of the new data (and old data in case of an update or delete) the computation to update the results table can often be made much faster.

Every situation is different of course, but it's overreaching to say that SQL systems have no options beyond read-time computation of results.

staunch16y ago

This was my first thought too. Why not just denormalize the data in MySQL? What benefit is Cassandra really offering here over that option?

jbellis16y ago

Cassandra gives you two benefits.

First, Cassandra uses a disk layout similar to the one described in the Bigtable paper (http://labs.google.com/papers/bigtable.html sections 5.3 and 5.4); in particular it does no random writes. Relational databases like mysql pretty much all use btree-based storage which was great 20 years ago but is terrible today when seeks are your bottleneck.

I was talking to some people today who are struggling to get mysql to do ~100 insert/update operations per second. Cassandra will easily give you 10x that -- _per node_.

The second benefit is that Cassandra gives you real, scalable partitioning, invisible to the app, for when you do need to add nodes. When you have more than a handful of machines, not having to babysit replication + partitioning is a huge, huge deal.

3 more replies

mikeryan16y ago· 2 in thread

This is weird to me.

"We started thinking seriously about deploying Cassandra in production around three weeks ago. After looking at the site for something that would be a good fit, we settled on green badges."

It seems completely baffling to me that someone would go out and compare different db solutions, pick one and THEN try to find a way to fit it on their site architecture.

jbellis16y ago

I read that as, "which feature is easiest to port from the existing mysql solution?" Which is a reasonable question to ask, given how different the two databases are.

mikeryan16y ago

My process would be this.

1. Decide that mysql (or other RDB) is problematic.

2. Decide which features cause the most pain (List them with the most painful on top)

3. Discover which alternative db solutions would do the best in alleviating the top issues.

1 more reply

gcv16y ago· 2 in thread

I read in several blog posts that Cassandra has its share of data-corrupting bugs. http://blog.evanweaver.com/articles/2009/07/06/up-and-runnin... mentioned that even Facebook does not use it as a system of record. Wonder how Digg deals with that.

jbellis16y ago

What that article says is if you do something analogous to backing up a live mysql database with dumb cp, Bad Things Happen.

Yes, we do plan to make it harder to shoot yourself in the foot, but this isn't a data-corrupting bug of the kind you meant.

imajes16y ago

Note that they're using cassandra for _computed_ data- the original components still exist elsewhere.

imajes16y ago· 2 in thread

Anyone from digg here? I'd love to know how long it took to build that dataset the first time. In other words: what's the recovery window like to reset that size dataset?

ieure16y ago

It takes around 27 hours. 3 to load the MySQL data into Hadoop, then 24 to compute and insert into Cassandra.

imajes16y ago

thanks!

by16y ago

Why does the first step "Query Friends for all my friends" take 1.5 seconds? I am struggling to understand this. If this simple table of say 100,000,000 rows is indexed on userid and we are only reading and returning say 200 rows for a particular userid what makes it so slow?

Quarrelsome16y ago

But surely that's broken. If I dig something but then add a friend AFTERWARDS, they wont see the shield as the bucket for my digg didn't contain their user id at write time. Am I missing something?

j / k navigate · click thread line to collapse

40 comments

27 comments · 8 top-level

justin_vanw16y ago· 6 in thread

SELECT `digdate`, `id` FROM `Diggs` WHERE `userid` IN (59, 9006, 15989, 16045, 29183, 30220, 62511, 75212, 79006) AND itemid = 13084479 ORDER BY `digdate` DESC, `id` DESC LIMIT 4;

Ok, how do we optimize this query?

Step 1: Keep all dugg items in memcached for the last N days, where N is determined by when you run out of memory. Then, your query becomes:

Step 2 would require an asynchronous process, so would be more complex.

Finally, you would have to figure out how much memory you would need to store N days of digs (users with no friends do not count in this). I believe it would not be very much.

ojbyrne16y ago

Also I find it fascinating that the table is "Diggs" (it was originally "digs" and I suspect it still is). Gotta get the branding right, even in sql queries.

stephenjudkins16y ago

It seems to me that using Cassandra, even in its current immature state, makes much more sense than the solution you're proposing.

justin_vanw16y ago

Mysql kindof sucks. The clustering is easy to set up at least.

My point was that cassandra is just a big memcached.

1 more reply

imajes16y ago

Sure.

but NoSQL kinda gives you memcache like awesomeness with some level of persistence. Think of cassandra as a more feature-rich, persistent memcache. :)

justin_vanw16y ago

No. NoSQL gives you memcached on a disk, which sucks a lot. SQL is good.

1 more reply

mcav16y ago· 5 in thread

Wow:

> For this feature, the fully denormalized Cassandra dataset weighs in at 3 terabytes and 76 billion columns.

breck16y ago

What's amazing is that 3 terabyes costs about $300 nowadays. Include good components, power, redundancy, maintenance, and it probably comes to $1,000 per year.

So this feature is actually pretty cheap to do, unless I'm missing something.

SwellJoe16y ago

Every time I read "terabyte", I have to pause, because my brain goes "That's gigantic!" and then a little voice says, "It's the same size as the hard disk in my desktop...it cost $89."

I have the same sort of weird mental disconnect with SD cards. It's just hard to fathom 32GB of data fitting into something about the size of a quarter.

2 more replies

jbellis16y ago

Serving up thousands of requests per seconds from 3 TB (while continually adding and updating more data!) is a lot different from just archiving it.

A _lot_. :)

2 more replies

joechung16y ago

3 TB of disk is easy. 3 TB flowing quickly from disk to RAM to CPU cache and back out again is another story.

1 more reply

pgebhard16y ago

I was thinking "WOW", as well, but breck's right. This shouldn't really be that surprising any more given the price of disk storage these days.

rarrrrrr16y ago· 2 in thread

Since the trigger has automatic access to the contents of the new data (and old data in case of an update or delete) the computation to update the results table can often be made much faster.

Every situation is different of course, but it's overreaching to say that SQL systems have no options beyond read-time computation of results.

staunch16y ago

This was my first thought too. Why not just denormalize the data in MySQL? What benefit is Cassandra really offering here over that option?

jbellis16y ago

Cassandra gives you two benefits.

I was talking to some people today who are struggling to get mysql to do ~100 insert/update operations per second. Cassandra will easily give you 10x that -- _per node_.

3 more replies

mikeryan16y ago· 2 in thread

This is weird to me.

"We started thinking seriously about deploying Cassandra in production around three weeks ago. After looking at the site for something that would be a good fit, we settled on green badges."

It seems completely baffling to me that someone would go out and compare different db solutions, pick one and THEN try to find a way to fit it on their site architecture.

jbellis16y ago

I read that as, "which feature is easiest to port from the existing mysql solution?" Which is a reasonable question to ask, given how different the two databases are.

mikeryan16y ago

My process would be this.

1. Decide that mysql (or other RDB) is problematic.

2. Decide which features cause the most pain (List them with the most painful on top)

3. Discover which alternative db solutions would do the best in alleviating the top issues.

1 more reply

gcv16y ago· 2 in thread

jbellis16y ago

What that article says is if you do something analogous to backing up a live mysql database with dumb cp, Bad Things Happen.

Yes, we do plan to make it harder to shoot yourself in the foot, but this isn't a data-corrupting bug of the kind you meant.

imajes16y ago

Note that they're using cassandra for _computed_ data- the original components still exist elsewhere.

imajes16y ago· 2 in thread

Anyone from digg here? I'd love to know how long it took to build that dataset the first time. In other words: what's the recovery window like to reset that size dataset?

ieure16y ago

It takes around 27 hours. 3 to load the MySQL data into Hadoop, then 24 to compute and insert into Cassandra.

imajes16y ago

thanks!

by16y ago

Quarrelsome16y ago

But surely that's broken. If I dig something but then add a friend AFTERWARDS, they wont see the shield as the bucket for my digg didn't contain their user id at write time. Am I missing something?

j / k navigate · click thread line to collapse