For each and every thing that Jason talked about...upgrading Cassandra, moving off EBS, embarking on self-heal and auto-scale projects...what took the reader a few seconds to read and cognise undoubtedly represented hours and hours of work on the part of the Reddit admins.
I guess it's just the nature of the human mind. I don't think I could ever fully appreciate the amount of work that goes into any project unless I've been through it myself (and even then, the brain is awesome at minimizing the memory of pain). So Reddit admins, if you're reading this, while I certainly can't fully appreciate the amount of labor and life-force you've dedicated to the site, I honestly do appreciate it, and I wish you guys nothing but success in the future!
I got to work with Riak a lot while I was at DotCloud, but the speed issue was pretty frustrating (it can be painfully slow).
Cassandra has enabled Reddit to manage a highly scalable distributed data store with a tiny staff. This is not to say it has been trouble free, but it has enabled them to do something that would have been infeasible without pioneers in this space (Cassandra, Riak, Voldemort, etc) making these tools available.
That said, they may be freaked out based on their growth curve and simply thinking ahead.
Balancing your cluster requires a little bit more handholding, and if something goes wrong or you fuck it up, it can be pretty challenging. But most of the time it's pretty painless.
There are a lot of other warts though, the data model is slightly weird, the secondary indexing is slow, and eventual consistency is hard to wrap your head around, but it doesn't require much effort to run and operate a large cluster, and if that's important to you and your application, you should check it out.
The NoSQL space is pretty interesting, but there is no clear winner, each of the competing solutions have their own niche, their own specialities, so it's impossible to give general recommendations right now.
The two phases we've seen are:
1/It's flexible and it works! Problem solved! 2/21st century called, they want their performance back.
The problem with phase 2 is that you may not be able to solve it by throwing more computing power at it.
Unfortunately if you really need map-reduce, at the moment I don't know what to recommend. Riak isn't better performance-wise and our product doesn't support map-reduce (yet).
However if you don't need map-reduce I definitively recommend not using Cassandra. There's a lot of non-relational databases out there that are an order of magnitude faster.
However I have the gut feeling we're far from squeezing out all the juice from today's hardware.
Godspeed, reddit. You're on the right track.
I did the same for my site last year and it was great.
This is one of the reasons why I haven't moved my Postgres databases to enterprisedb or heroku: they use ebs.
So our approach is to RAID-10 (4) local volumes together. We then use replication to at least 3 slaves, all of which are configured the same and can become master in the event of a failover.
We use WAL-E[0] to ship WAL logs to S3. WAL-E is totally awesome. Love it!
Please send feedback or patches, I'm happy to work with someone if they had an itch in mind.
If one has a lot of data, EBS becomes much more attractive because swapping the disks in the case of a common failure (instance goes away) is so much faster than having to actually duplicate the data at the time, presuming no standby. Although a double-failure of ephemerals seems unlikely and the damage is hopefully mitigated by continuous archiving, the time to replay logs in a large and busy database can be punishing. I think there is a lot of room for optimization in wal-e's wal-fetching procedure (pipelining and parallelism come to mind).
Secondly, EBS seek times are pretty good: one can, in aggregate, control a lot more disk heads via EBS. The latency is a bit noisy, but last I checked (not recently) considerably better than what RAID-0 on ephemerals for some instances would allow one to do.
Thirdly, EBS volumes are sharing with one's slice of the network interface on the physical machine. That means larger instance sizes can have less noisy-neighboring effects and more bandwidth overall, and RAID 1/1+0 are going to be punishing. I'm reasonably sure (but not 100% sure) that mdadm is not smart enough to let a disk with decayed performance "fall behind", demoting it from an array, using a mirrored partner in preference. Overall, use RAID-0 and archiving instead.
When an EBS volume suffers a crash/remirroring event they will get slow, though, and if you are particularly performance sensitive that would be a good time to switch to a standby that possesses an independent copy of the data.
[0]: http://orion.heroku.com/past/2009/7/29/io_performance_on_ebs...
I'm not exactly sure how the failover system works in Postgres, the last time I setup replication on postgres it would only copy the WAL log after it was fully written, but I know they have a much more fine grained system now.
If you use SQL Server you can add a 3rd monitoring server and your connections failover to the new master pretty much automatically as long as you add the 2nd server to your connection string. Using the setup with a 3rd server can create some very strange failure modes though.
8760 is the number of hours in a year.
(8760 * $0.24 * 170) + (8760 * $0.12 * 70) = $430,992/yr in hourly fees
($1,820 * 170) + ($910 * 70) = $373,100/yr in reservation fees
373,100 + 430,992 = 804,092 / 12 months = $67,007.67/mo
Reference for last years calculations: http://www.reddit.com/r/blog/comments/ctz7c/your_gold_dollar...
The internet is truly a wonderous thing.
Supposing that the cost increased linearly with the number of users (which sounds like a bad hypothesis, but is a start), the cost at the end of 2011 could be around 1M/year... That's impressive, but nowhere near the 300K/month proposed by rdouble.
So I would say that the monthy cost of reddit's infrastructure is around 90K. Which is really impressive.
1: http://www.reddit.com/r/blog/comments/ctz7c/your_gold_dollar...
https://www.facebook.com/notes/facebook-engineering/building-timeline-scaling-up-to-hold-your-life-story/10150468255628920I couldn't imagine why.
2 TB OMG, thats almost a decent sized SQL Server instance. Yeah, it should take about an hour or two to replicate. I'm assuming they have a 10Gb enet on their DB server.