"Amazon's EBSs are a barrel of laughs in terms of performance and reliability" (opens in new tab)

(reddit.com)

270 pointsquilby15y ago152 comments

152 comments

102 comments · 30 top-level

ck215y ago· 11 in thread

I firmly believe "the cloud" is a fad, unless for some reason you own and operate all the hardware yourself (ie. Google).

Like other technical fads, everyone will probably come back to servers they can reach out and touch when needed, sooner or later.

jedsmith15y ago

The cloud significantly lowers capital expenditure to get into an Internet-enabled business, which cultivates the very startup ecology that Y Combinator exists to leverage and support. Those teenagers who started the Facebook Pokemon game would have never had the resources to build a scalable solution with hardware that they own. (That is, unless Y Combinator paid a lot more money as part of participating. They might also be a bad example, because I remember that one of them had a successful sale...it's true for a lot of other ideas, so work with the example.) The cloud lowers the barrier of entry enough that good ideas can be explored and built, with very little financial risk to those getting into it.

This was the role of shared hosting in the past. Several years ago, everybody realized that having root is better. Now, instead of colocating two servers and negotiating transit and dealing with remote hands, you can spin up two Linodes for $40 and have enough power to build anything. Critical mass? Add three more. You're not waiting for a shipment of servers to the datacenter to handle a sudden load from a positive mention on HN.

Saying that the cloud is a fad and we should all own our gear does two things: (a) increases humanity's carbon footprint, since most organizations never utilize hardware to their full potential, and (b) guarantees that only those with significant capital to buy a fleet, a cage, and power will ever compete in the Internet space, which is where we were many years ago. It is very arguable that the cloud is progress, and everybody sitting on the sidelines calling it a "fad" is scared by it.

Jeremy Edberg of Reddit had a good comment later in that thread, to someone who paralleled the cloud to electricity generation:

http://www.reddit.com/r/blog/comments/g66f0/why_reddit_was_d...

What sucks is, my remarks really depend on what you define "cloud" as, which -- partially thanks to Microsoft television commercials -- is currently up in the air.

api15y ago

The cloud's real advantage is the ability to build out fast, but it is not cost. It is cheaper to build it yourself and run it yourself if you know exactly what you need, and have time to do so. If you don't, the cloud is cheaper.

So you're right that the cloud is great for startups. It is not so great for established stuff.

2 more replies

romaniv15y ago

You can do a lot of the things you describe here with VPS (Virtual Private Servers). You get root access, you don't manage hardware, you often receive some virtualization benefits (images, snapshots). Does that count as "Cloud Computing"?

1 more reply

ck215y ago

If you start with the cloud, you best formulate an "exit plan".

Reddit for example doesn't seem to have one and seems quite stuck.

2 more replies

powertower15y ago

Cloud = Marketing(VPS);

1 more reply

gnaritas15y ago

> I firmly believe "the cloud" is a fad

You are wrong.

> everyone will probably come back to servers they can reach out and touch when needed, sooner or later.

No they won't, because most of us don't want to be managing hardware, ever.

tomkarlo15y ago

If you've ever had to deal with the expense and overhead associated with running a business that has extensive production systems, you wouldn't say that. The cloud represents a huge decrease in the initial cost necessary to set up production systems, and it relieves businesses of all kinds of issues regarding long-term leases on equipment or depreciation / amortization of equipment. You don't have to worry about swapping out racks just because they've reached an arbitrary end-of-lease date. You don't have to worry about provisioning hardware months in advance to make sure it's available "if" you need it.

There are definitely hiccups, but I can't imagine many guys running an internet-heavy business going forward are seriously going to say "let's build out our own datacenter rather than solve the issues with the cloud" unless they're doing something really, really, specialized.

Duff15y ago

It's not a fad, it's shared services. Sharing comes at the cost of flexibility, which can be a pain in the butt.

Personally, if I'm going to be operating a large computing environment, I'd rather stick 80% of my workload in a cloud environment and pay someone to deal with utilities, buildings, hardware, etc.

The remaining 20% may require a "higher touch" setup at a colo or a facility that I control. The smaller I can make that 20%, the less I need to spend on setting up and maintaining infrastructure.

fuzzmeister15y ago

That is entirely wrong. With AWS, we've built a multi-AZ load balanced infrastructure for very little time and money. Getting an equivalent setup out of our own hardware would have been orders of magnitude more expensive and time consuming.

vdondeti15y ago

Can you please explain how you built a multi-AZ load balanced infrastructure, given that Amazon's ELB only load balances within a given AZ. I assume you used some external service. Would you mind providing the details. Thanks.

1 more reply

gaius15y ago

You know, back in the 70s, there was the concept of a "computer bureau", these would be some people who had a mainframe and you would rent time on it by the hour, so if you had a payroll run, or a simulation, or whatever, you would upload it to them via a modem (or courier them the punchcards!), run it there, download the results (or get them delivered printed out). Early BBSs and MUDs often ran in spare capacity on these mainframes.

There ain't nothin' new under the sun...

mithaler15y ago· 8 in thread

We were bitten by EBS' slowness at my company recently, when moving an existing project to AWS. You effectively can't get decent performance off of a single EBS volume with PostgreSQL; you need to set up 10 or so of them and make a software RAID to remove the bottleneck. It's a fairly large time commitment to build and maintain, but it's pretty fast and reliable once it's up and running (cases like the recent downtime notwithstanding).

Can anyone tell me if MySQL fares any better than Postgres on a single EBS volume? I wouldn't assume it does but I shouldn't be making assumptions.

gpapilion15y ago

MySQL does not fare any better on a single EBS volume. The issues with EBS are systematic. Similarly you have to raid several volumes together to see decent performance, and this is the recommended AWS solution.

joevandyk15y ago

Did you use Raid10? I would love to see a post on using postgresql with ec2/ebs -- how to setup raid, etc.

grourk15y ago

Orion Henry at Heroku wrote about this and described different software RAID configurations and the performance characteristics of each a while back:

http://orion.heroku.com/past/2009/7/29/io_performance_on_ebs...

1 more reply

gregburek15y ago

I found a benchmark from 2008 that details the problems with RAID10 and sourced it in a comment above [1]. These are just raw disk transfer numbers, though. I can only imagine how they would change as CPU usage/postgres load climbs. IIRC disk IO is network traffic and network traffic is CPU dependent, so as load increases, IO will suffer greatly.

[1] http://news.ycombinator.com/item?id=2341425

gregburek15y ago

Build-out Script for Postgres/PostGIS with RAID 10 on Amazon EBS volumes: http://sproke.blogspot.com/2010/12/build-out-script-for-post...

Vivtek15y ago

I second that.

saurik15y ago

Did you do any performance tweaking to PostgreSQL with respect to EBS? You have an insanely deep write buffer and quite good random read performance with EBS, which is nothing like the disks people normally deploy PostgreSQL to.

bmurphy15y ago

I tuned the hell out of our big postgresql instance a year ago, but I'll be damned if I can remember the rational for every change. I have a list of all the changes from default, but I've long since forgotten/lost the reason for making them.

That being said, we get more bang for our buck by spreading our data across many small databases that don't need much tuning beyond upping the memory defaults. The EC2 cloud isn't great for the uber-server, but it's halfway decent for many small servers.

jameskilton15y ago· 7 in thread

This comment further down, supposedly from an Amazon employee, paints a grim picture for EBS: http://www.reddit.com/r/blog/comments/g66f0/why_reddit_was_d...

jedsmith15y ago

The perspectives of disgruntled employees have been known to be worse than reality, on occasion. Not definitively saying that's the case here, just saying.

hn_throw_away15y ago

I work at Amazon - a lot of teams are like this. They're stuck managing a woefully broken product and spend all of their time propping up the beast, leaving no capacity left for meaningful fixes (in these cases, meaningful fixes are always gigantic engineering projects).

The team develops a reputation internally for being glorified firefighting, and have trouble recruiting. More senior engineers eventually flee (having, well, choice in the matter) leaving a team heavy with junior talent with no seasoned gurus leading the way.

The company is also growing at ludicrous speed, and hiring is difficult. When the product is in such a painful state, attrition from the team is high, and with slow hiring you are barely countering attrition (exacerbating the junior talent problem), and not even close to growing the team to be in a position to take care of the problem for good.

I suspect this is an industry-wide problem though, and is hardly unique to this place.

3 more replies

ceejayoz15y ago

As are often the perspectives of folks from another team, or trolls posing as employees. Definitely a grain of salt needed.

asb15y ago

Even if true, I don't think that comment is fair to the EBS team. It seems a likely reason for that behaviour is they don't have enough resource to work on and test (presumably) large changes to fix the underlying issue while also fixing the tickets which crop up. The tone of the comment seems to imply they're foolishly overlooking the obvious solution.

smhinsey15y ago

This is totally true, but at the same time, given the success and scale of AWS, it's insane that they would not have the resources they need.

2 more replies

watchandwait15y ago

This comment does not seem legitimate. There's nothing in it that would imply special knowledge of AWS or EBS.

gexla15y ago

Yeah, I call BS on this one. Amazon AWS is for certain use cases. It has never been a platform for all solutions. In fact, it has mostly been a platform for people to craft their own solutions. AWS is not a web hosting platform. If you want a web hosting platform, you create one the best you can from the tools available. This is the sort of response I would expect from an employee of AWS, not the one I saw in those comments. That or maybe the comment was from a customer service guy who isn't a developer.

I'm surprised Reddit ever though AWS would be a good platform to host on. You don't bitch about it, you create the best system you can and if something doesn't work, then you need to do more work. If you don't want to put in the work, then AWS is wrong for you. You don't see Heroku bitching about AWS, rather they made the thing work for them with great engineers.

rlpb15y ago· 6 in thread

RAIDing together multiple EBS volumes feels like a massive hack to me. I can't help but wonder if this compounds the problem at Amazon's end. If EBS performance is a problem, Amazon need to fix it. For example, if some way of tying together multiple EBS volumes is a reasonable way of working around the problem, then why aren't Amazon providing "high performance" EBS volumes which do that under the hood?

If I were faced with EBS performance issues, I would see this as a big red flag, consider EBS unsuitable for the application and avoid it, rather than carrying on with such a workaround.

andrewvc15y ago

One other huge downside of raiding EBS volumes is you can't use EBS's snapshotting features as you cannot guarantee a perfect sync (you could use LVM yourself however).

Honestly, since EBS vols are supposedly not tied to a single disk, the raiding should be done on Amazon's end. That it isn't is telling.

saurik15y ago

You have to snapshot at the system level anyway if you want a consistent snapshot: otherwise the filesystem (or your database) could have been reordering and delaying writes that end up not being part of the "consistent snapshot". This is simply not a RAID-specific issue, nor is it a problem with EBS (as it is generally easy to use LVM, xfs, and/or PostgreSQL to handle that part of the job).

2 more replies

WALoeIII15y ago

xfs_freeze

In fact there is a handy package called ec2-consistent-snapshot (https://launchpad.net/ec2-consistent-snapshot) that will manage this for you!

bluegene15y ago

May be I'm missing something here; Why there's even a discussion about RAID at the EBS level? When Amamzon says, "Amazon EBS volumes are designed to be highly available and reliable" and if we have to talk about RAID then the issue is on Amazon's end

xpaulbettsx15y ago

I think most people are doing RAID-0 to get more perf out of EBS volumes

2 more replies

acdha15y ago

I wish I had more than one upvote for this: swimming against a trend like that never works out well.

gruseom15y ago· 5 in thread

Anybody care to comment on using EC2 with local (what Amazon calls ephemeral) storage and backup to S3? Seems to me the advantages are: it's cheaper and you avoid the performance and reliability problems with EBS. The disadvantages?

snorkel15y ago

Using EBS has other features that are hard to overlook, such as snapshots and ability to quickly move your volumes to another instance when an instance failure happens, or if you needed to change the size of an instance (which you couldn't do directly until very recently).

krakensden15y ago

All of your EC2 instances can disappear without warning and everything on the local storage is now gone forever.

gruseom15y ago

That's the "backup to S3" part.

1 more reply

cachemoney15y ago

EBS-RAID0 is much faster for reads than local. Local is faster for writes.

riledhel15y ago

this seems to contradict several comments here. "citation needed".

2 more replies

absconditus15y ago· 5 in thread

How is it that Amazon.com is so reliable if there are so many problems with their "cloud" products? Do they not use the same software to run their site?

gpapilion15y ago

If you understand the limitations of the various products you can build a VERY reliable service. The reddit assumption of a single datacenter and single technology to store that data was an engineering failure. They essentially didn't have a disaster recovery plan in place.

snorkel15y ago

I'm sure reddit's engineers are as capable as any for producing a seemless disaster recovery plan, but the most common obstacle to implementing it is cost. Most web services choose the occasional risk of downtime in one data center instead of incurring the cost of being in two data centers at all times.

1 more reply

weavejester15y ago

I suspect it's because amazon.com has different performance requirements. For instance, I imagine the read/write balance is very different for amazon.com than for reddit.com.

danielrhodes15y ago

Amazon.com is not hosted on EC2. It's entirely separate.

rbranson15y ago

This isn't entirely true. Amazon.com uses EC2 in addition to dedicated servers.

http://searchcloudcomputing.techtarget.com/news/1516269/Amaz...

2 more replies

snorkel15y ago· 4 in thread

Having been at a startup that used hundreds of EC2 instances and EBS volumes I can assure you all that Amazon EBS performance is downright terrible and Amazon didn't inspire any confidence that they could solve it.

Even worse than the EBS performance is Amazon does not offer any shared storage solutions between EC2 instances. You have to cobble together your own shared storage using NFS and EBS volumes making it sucky to the Nth power.

EC2 is fine for Hadoop-style distributed work loads, and distributed data stores that can tolerate eventual consistency, that's all good. But for production database applications requiring constant and reliable performance, forget it.

watchandwait15y ago

My experience with the AWS RDS database product has been excellent.

krobertson15y ago

We looked at RDS and had a call with some of their engineers, but we basically had our EC2 + raid'd EBS set up almost the same as they did, all best practices already being done.

Since RDS really is EC2 + EBS, they couldn't provide any real assurances it performed better than our own installation.

We ended up moving off of AWS as a whole. After several discussions about how we can continue to scale, the ultimate answer was without AWS.

EC2 is great for distributed stuff, but when need something that is heavy IO, for instance, it is a big problem. Scaling it ends up costing more to work around AWS's performance problems than to go elsewhere.

2 more replies

ceejayoz15y ago

The biggest issue I have with RDS is that I can't do a multi-master deployment to scale up writes. I've got a very write-heavy workload in my systems (roughly one write for every two reads).

1 more reply

btucker15y ago

Anything more you could share about it? We got scared off (at least for now) by the inability to replicate in from a self-hosted MySQL instance (for migration purposes), but would still love to hear more about your experiences.

tzs15y ago· 4 in thread

We've been looking at moving some or all of our stuff to either Amazon EC2/EBS/S3 or Rackspace cloud hosting, and it has been interesting.

Amazon seems more flexible, since you buy block storage (EBS) independent of instances. If you have an application that needs a massive amount of data, but only a little RAM and CPU, you can do it.

Rackspace, on the other hand, ties storage to instances. If you only need the RAM and CPU of the smallest instance (256 MB RAM) but need more than the 10 GB of disk space that provides, you need to go for a bigger instance, and so you'll probably end up with a bigger base price than at Amazon.

On the other hand, the storage at Rackspace is actual RAID storage directly attached to the machine you instance is on, so it is going to totally kick Amazon's butt for performance. Also, at Amazon you pay for I/O (something like $0.10 per million operations).

Looking at our existing main database and its usage, at Amazon we'd be paying more just for the I/O than we now pay for colo and bandwidth for the servers we own (not just the database servers...our whole setup!).

The big lesson we've taken away from our investigation so far as that Amazon is different from Rackspace, and both are different from running your own servers. Each of these three has a different set of capabilities and constraints, and so a solution designed for one will probably not work well if you just try to map it isomorphically to one of the others. You don't migrate to the cloud--you re-architect and rewrite to the cloud.

delano15y ago

If you're interested to see how sites perform on EC2 and Rackspace over time:

https://www.blamestella.com/vendor/ec2

https://www.blamestella.com/vendor/rackspace

bretpiatt15y ago

You're monitoring from AWS US-East it looks like, you'll want to mention that to give people some context around the latency numbers.

1 more reply

icey15y ago

I use the Rackspace cloud for a few Windows servers. The experience has been mostly positive, but they disappear for a few minutes each week, which is kind of troubling.

metageek15y ago

Disappear as in they crash and reboot, or disappear as in they're unpingable?

1 more reply

jclouds-fan15y ago· 4 in thread

Why is reddit relying on only one cloud provider? AWS can/should do better but service providers of the size of reddit should be using mult-vendor set-ups for sure.

bhousel15y ago

They did say in their original post-mortem that spreading the load among multiple availability zones has been on their todo list for a while. It has just taken longer than they expected with their limited engineering staff.

jvanenk15y ago

It probably has something to do with the group being very small. Sure they turn a lot of traffic, but there's only so much you can do with a group of their size on what I imagine is still a limited budget.

rworth15y ago

Sounds like a case of similar to safety systems at a nuclear plant. Not pressing until it is REALLY PRESSING! Its the usual dilemma, investing time/moey on something that most likely wont be needed versus adding that cool feature all the users will immediately see the benefit of. In a competitive environment, it isn't difficult to understand how they ended up on one vendor.

1 more reply

fuzzmeister15y ago

Is a multi-provider setup common? I certainly think Reddit should be on multiple availability zones within AWS, but spanning multiple providers seems hugely more difficult.

parasubvert15y ago· 2 in thread

Generally speaking this is the sort of thing that people warn about when they say "if you want to run on a cloud, you need to design your application for a cloud". Meaning, you can't presume your infrastructure is dedicated and carries similar MTBFs of (say) an enterprise hard drive, which upwards of 1 million hours.

Amazon provides plenty of opportunities to mitigate for this, such as providing multiple availability zones. Reddit, if you read the original blog post, wasn't designed for that - it was designed for a single data centre.

OTOH, the variability of EBS performance is true, and frustrating. If you do a RAID0 stripe across 4 drives, you can expect around sustained 100 MB/sec in performance modulo hiccups that can bring it down by a factor of 5. On a compute cluster instance (cc1.4xlarge) it's more like up to 300 MB/sec if you go up to 8 drives, since they provision more network bandwidth and seem to be able to cordon it off better with a placement group.

khafra15y ago

> modulo hiccups that can bring it down by a factor of 5.

The comments on reddit indicated hiccups more on a factor of 10x and, sometimes, 100x.

Either way, the issue is that the more drives you add to your RAID0, the more often one of those drives experiences a "hiccup," and kills the performance of the entire volume.

parasubvert15y ago

It's not clear this was a single volume problem so much as an issue with one or more network switches in that availability zone (if you look at the AWS service health notes for that date).

Even in your own data centre, if your FC fabric goes wonky, your whole SAN is hosed.

Kilimanjaro15y ago· 2 in thread

Lesson for startups: start in the cloud, grow your business, build your own cloud.

Never trust critical parts of your business to others.

mkramlich15y ago

Good advice but I'd argue there's one tweak to make that even better: start outside the cloud (say, just some Linux VM's from Linode or whatever), then only if you get enough real customer/visitor demand to warrant easy/virtual scaling, then move to a cloud provider. Needing a cloud/elastic hosting provider is a bit of a Maserati Problem. If you get to the point where you have to build/manage your own data centers (like Google, Amazon, Orbitz), you have a Fleet-of-Maseratis Problem.

spidaman15y ago

Netflix seems to be the biggest counter case - grew their data centers and effectively gave up and moved it to AWS. I suspect the sweet spot is doing a bit of public and private cloud, adjusting how much is on one or the other based on costs, service levels and capacity requirement volatility.

floodfx15y ago· 2 in thread

I'll probably be downvoted for this but seems to me the root cause of this problem is Reddit's architectural decision to remain in a single availability zone. If it wasn't EBS it could have been some other issue related to the single AZ that could have brought the site down. Blaming EBS, particularly if you knew it to be a potential weakness in your architecture, seems like a deflection of responsibility.

snorkel15y ago

Perhaps reddit could've mitigated some downtime with some cross-zone redundancy, but the underlying frustration is that Amazon does not provide a well behaved storage solution, which is a very critical infrastructure component for most web services.

fuzzmeister15y ago

Exactly. While Amazon clearly tries to make single-zone reliability as good as possible, I think they expect customers to use a multi-AZ setup if they expect true reliability.

Zak15y ago· 2 in thread

I recently had an EBS volume lose data for no apparent reason. I'm not a heavy EC2 user at all - I was just doing some memory/cpu-heavy stuff that wouldn't fit in to RAM on my laptop and using EBS as a temporary store so I could transfer data using a cheap micro instance and only spin up the big expensive instances when everything was in place. I ended up downloading files on an m2.4xlarge because the files I had just downloaded to the EBS volume vanished.

saurik15y ago

Are you certain the data left the filesystem buffer and actually got acknowledged by EBS?

Zak15y ago

No; I'm very much a beginner when it comes to EC2. I unmounted the filesystem, detached the volume, then shut down the instance.

Andys15y ago· 2 in thread

The AWS business model is to sell shared hosting on commodity hardware. Cloud is a cool buzzword but it is still sharing hardware. Cheap, commodity hardware is the magic that lets you scale up so big and so fast for a highly accessible price.

But you're still sharing the same hardware as everyone else and its still just commodity hardware.

smhinsey15y ago

For what it's worth, it's not entirely accurate to say that you are always using shared hardware on AWS, at least for your servers. It depends on how you set up your environment.

weavejester15y ago

Sharing hardware is an implementation detail. You could potentially build a cloud infrastructure where everyone has dedicated hardware. The whole point of cloud computing is that the implementation shouldn't matter to end uses.

jedsmith15y ago· 1 in thread

Never fails: a cloud provider has issues with a specific cloud product, so clearly the cloud is an illusion that will crash down on you[1]. Any discussion about any cloud provider's product is obviously a chance to soapbox about the industry as a whole.

[1]: http://www.reddit.com/r/blog/comments/g66f0/why_reddit_was_d...

magicseth15y ago

In the minds of many people, Amazon is the most well known, and respected cloud provider. Their outage is merely a reminder that the one big difference between cloud services and in-house services is, I can't control it. Of course the probability that Amazon will have better uptime than you is pretty high for most people, but you have no recourse when there is a problem.

hemancuso15y ago· 1 in thread

I've never understood how people can use EBS in production. The durability numbers they quote are bad and they wave their hands around about increased durability with snapshots, but never quantify what that means.

Hard drives are unreliable and they certainly don't fail independently of one another - but the independence of their failure is much more independent than EBS.

With physical dives and n-parity RAID you drastically reduce the rate of data loss. This is because although failures are often correlated, it's quite unlikely to have permenant failure of 3 drives out of a pool of 7 within 24 hours. It happens, but it is very rare.

With EBS, your 7 volumes might very well be on the same underlying RAID array. So you have no greater durability by building software RAID on top of that. If anything, it potentially decreases durability.

You could utilize snapshots to S3, but is that really a good solution? It seems that deploying onto EBS at any meaningful scale is a recipe for garunteed data-loss. Raid on physical disks isn't a great solution either, and there is no substitute for backups - but at least you can build a 9 disk RaidZ3 array that will experience pool failure so rarely that you can more safely worry about things like memory and data bus corruption.

saurik15y ago

The increased durability based on snapshots is actually quite simple, and they explain it in various places: if one of the drives in Amazon's RAID fails, they need to bring up a new disk to replace it in the array. When they being up new disks they typically can do this instantaneously, because they really just dynamically page fault the drive from your latest snapshot. However, all dirty data since the last snapshot will have to be copied from the other drive(s). This is a window of time during which your array is exposed to unrecoverable read errors losing data. The less dirty data you have, the smaller this window of time.

prakash15y ago· 1 in thread

We (Cedexis) presented our findings on - How do EC2's East, West, EU & APAC zones compare: (pdf) http://www.cloudconnectevent.com/2011/presentations/free/76-...

If you would like to know more please send me an email: prakash [at] cedexis.com

jerf15y ago

You should post that to HN, if you haven't already. Possibly wrap a blog post around it.

steve91815y ago· 1 in thread

This very moment our team is restoring Postgres volumes because the EBS volumes our primary and secondary were on both failed simultaneously.

obfuscate15y ago

Were both in the same availability zone?

danielrhodes15y ago· 1 in thread

What's the failure rate of EBS versus having direct access to physical disks? My guess is that at scale, it's probably similar.

Although you would hope that the storage components of AWS's cloud were highly reliable, I think the main benefit is not single instance reliability but being able to recover faster because of quickly available hardware.

bmurphy15y ago

I don't have solid numbers, just some experience using this. Ephemeral drives outright fail more often than EBS volumes, however, EBS volumes suffer performance degradation significantly more often than ephemeral drives. EBS volume performance is HIGHLY variable, at all times of day, no matter what load you throw at it. Ephemeral drives are very consistent most of the time.

Both types of drives CAN and DO fail, so RAID-10, fail over, and replication are a must have.

j_s15y ago· 1 in thread

Being totally new to AWS, why does everyone skip right past using ZFS?

http://blogs.sun.com/marchamilton/entry/a_brilliant_argument... "Cloud Storage Will Be Limited By Drive Reliability, Bandwidth ... The key feature of ZFS enabling data integrity is the 256-bit checksum that protects your data."

jodrellblank15y ago

ZFS will ensure that what was written to disk comes back to memory consistently, or with errors spotted. It wont ensure that the right thing was written to disk, or that the database IDs which were written leave your database relationships in a consistent state, etc.

ZFS will do nothing about this "More recently we also discovered that these disks will also frequently report that a disk transaction has been committed to hardware but are flat-out lying.", for instance, other than tell you the data you want isn't there to be read - like any filesystem would.

PaulHoule15y ago· 1 in thread

I love the idea behind EBS, a SAN makes life so much easier, but I too find that EBS glitches are the largest cause of unreliability in AWS.

I'm not immediately planning to move out of AWS, but the trouble with EBS has certainly got me thinking about other options and has made me much less inclined to make an increased commitment to AWS.

drivebyacct215y ago

EBS is not a SAN which is largely the point being made in these comments and in the other HN article on reddit's post mortem.

lurker1715y ago· 1 in thread

EMR is a mess too. The Amazon-blessed Pig is almost a year and 2 major releases behind, and the official EMR documentation seems to describe a version of EMR that doesn't even exist.

"Elastic" is AWS's claim to fame, but I am not seeing it.

Trying to resize an EMR cluster (which is half the point of having an EMR cluster instead of buying our own hardware) generates the cryptic error "Error: Cannot add instance groups to a master only job flow" that is not documented anywhere.

(Why would Amazon even implement a "master only job flow", which serves no purpose at all?)

adpowers15y ago

The master only job flow is designed to let users play around with the instance and discover things without having to pay for a full cluster. A single node versus multi-node cluster is configured way differently and that is why you can expand a single node cluster. If you had started with a two node cluster you would have been able to expand it.

Also, if you want Pig you should complain about it vocally on the EMR forum. That is the best way to get them to listen to you.

SemanticFog15y ago

We had consistent serious problems related to EBS for a several-month streak about a year ago, and I heard almost identical stories from other EC2 users around the same time. Instances with EBS attached would suddenly become completely unreachable via the network. Sometimes we had to terminate the instances, but usually we could revive them by detaching all (or most) of the EBS volumes, then reattaching and rebooting. Amazon seems to have fixed this problem, but I wouldn't be surprised if we suffered in the future the way reddit has.

Overall, EC2 is a very impressive offering, for which I commend Amazon. At times, I've been so frustrated that I'm ready to switch, but they fix things just quickly enough that I never quite get around to it. In the end, I'm willing to accept that what they're doing is hard, there will be mistakes, and it's worth suffering to get the flexibility and cost-effectiveness that EC2 offers.

1 more reply

bmurphy15y ago

Having been running a 200gb millions of transactions per day Postgres cluster on Amazon's EC2 cloud for two years now, I can attest to the fact that EBS performance and reliability SUCKS. It is our SINGLE biggest problem with EC2.

200gb really isn't all that big of a database. It shouldn't have to be this hard.

jread15y ago

I was at the Cloud Connect conference last week. In a session on cloud performance Adrian Cockcroft (Netflix's Cloud Architect) spoke and said they do not use EBS for performance and reliability issues. They initially had some bad experiences with EBS and because of this decided to stick with ephemeral storage almost exclusively.

The guys from Reddit also spoke about their use of EC2. Apparently they are running entirely on m1 instances which suffer from notoriously poor EBS performance relative to m2 and cc1/cg1 instances.

obfuscate15y ago

For a data set in the mere tens to hundreds of GB (in MongoDB, if anyone's curious), is there any reason I shouldn't conclude from this that I should use instance storage only (with multi-AZ replication and backups to S3, both of which I would be doing in any case)? Moderately slower recovery in the rare event of an instance failure seems better than the constant possibility of incurable killing performance degradation.

(Edit: I hadn't considered the possibility of somehow killing all my instances through human error. Ouch. That probably warrants one slave on EBS per AZ.)

cpg15y ago

This seems too much of a coincidence.

We released a dropbox-like product to sync and the back-end is on EBS. Yesterday we saw two times when a device got filled to 7GB and as it got closer it became slower and slower and slower. We did not have any instrumentation/monitoring in place and we were immediately suspect it was something on our end.

We (wrongly?) assumed reliability and (decent) performance from AWS.

natch15y ago

Isn't EBS intended for stuff like Hadoop job temporary data used during processing?

This kind of complaint reminds me of people who buy a product that does A very well, but then they trash it in reviews for not doing B. It was never advertised as doing B, but you'd never know that from the complaining.

amitraman115y ago

We used Amazon and got bad performance in the beginning too. It is bad when you pull files out of S3. By bad I mean the latency is high.

We tried GoGrid and they lost or crashed our server instance.

I've personally used Rackspace, so far so good, but I've only been doing development on it.

yuhong15y ago

On the comment itself, I have this: http://news.ycombinator.com/item?id=2339715

j / k navigate · click thread line to collapse

152 comments

102 comments · 30 top-level

ck215y ago· 11 in thread

I firmly believe "the cloud" is a fad, unless for some reason you own and operate all the hardware yourself (ie. Google).

Like other technical fads, everyone will probably come back to servers they can reach out and touch when needed, sooner or later.

jedsmith15y ago

Jeremy Edberg of Reddit had a good comment later in that thread, to someone who paralleled the cloud to electricity generation:

http://www.reddit.com/r/blog/comments/g66f0/why_reddit_was_d...

What sucks is, my remarks really depend on what you define "cloud" as, which -- partially thanks to Microsoft television commercials -- is currently up in the air.

api15y ago

So you're right that the cloud is great for startups. It is not so great for established stuff.

2 more replies

romaniv15y ago

1 more reply

ck215y ago

If you start with the cloud, you best formulate an "exit plan".

Reddit for example doesn't seem to have one and seems quite stuck.

2 more replies

powertower15y ago

Cloud = Marketing(VPS);

1 more reply

gnaritas15y ago

> I firmly believe "the cloud" is a fad

You are wrong.

> everyone will probably come back to servers they can reach out and touch when needed, sooner or later.

No they won't, because most of us don't want to be managing hardware, ever.

tomkarlo15y ago

Duff15y ago

It's not a fad, it's shared services. Sharing comes at the cost of flexibility, which can be a pain in the butt.

Personally, if I'm going to be operating a large computing environment, I'd rather stick 80% of my workload in a cloud environment and pay someone to deal with utilities, buildings, hardware, etc.

The remaining 20% may require a "higher touch" setup at a colo or a facility that I control. The smaller I can make that 20%, the less I need to spend on setting up and maintaining infrastructure.

fuzzmeister15y ago

vdondeti15y ago

1 more reply

gaius15y ago

There ain't nothin' new under the sun...

mithaler15y ago· 8 in thread

Can anyone tell me if MySQL fares any better than Postgres on a single EBS volume? I wouldn't assume it does but I shouldn't be making assumptions.

gpapilion15y ago

joevandyk15y ago

Did you use Raid10? I would love to see a post on using postgresql with ec2/ebs -- how to setup raid, etc.

grourk15y ago

Orion Henry at Heroku wrote about this and described different software RAID configurations and the performance characteristics of each a while back:

http://orion.heroku.com/past/2009/7/29/io_performance_on_ebs...

1 more reply

gregburek15y ago

[1] http://news.ycombinator.com/item?id=2341425

gregburek15y ago

Build-out Script for Postgres/PostGIS with RAID 10 on Amazon EBS volumes: http://sproke.blogspot.com/2010/12/build-out-script-for-post...

Vivtek15y ago

I second that.

saurik15y ago

bmurphy15y ago

jameskilton15y ago· 7 in thread

This comment further down, supposedly from an Amazon employee, paints a grim picture for EBS: http://www.reddit.com/r/blog/comments/g66f0/why_reddit_was_d...

jedsmith15y ago

The perspectives of disgruntled employees have been known to be worse than reality, on occasion. Not definitively saying that's the case here, just saying.

hn_throw_away15y ago

I suspect this is an industry-wide problem though, and is hardly unique to this place.

3 more replies

ceejayoz15y ago

As are often the perspectives of folks from another team, or trolls posing as employees. Definitely a grain of salt needed.

asb15y ago

smhinsey15y ago

This is totally true, but at the same time, given the success and scale of AWS, it's insane that they would not have the resources they need.

2 more replies

watchandwait15y ago

This comment does not seem legitimate. There's nothing in it that would imply special knowledge of AWS or EBS.

gexla15y ago

rlpb15y ago· 6 in thread

If I were faced with EBS performance issues, I would see this as a big red flag, consider EBS unsuitable for the application and avoid it, rather than carrying on with such a workaround.

andrewvc15y ago

One other huge downside of raiding EBS volumes is you can't use EBS's snapshotting features as you cannot guarantee a perfect sync (you could use LVM yourself however).

Honestly, since EBS vols are supposedly not tied to a single disk, the raiding should be done on Amazon's end. That it isn't is telling.

saurik15y ago

2 more replies

WALoeIII15y ago

xfs_freeze

In fact there is a handy package called ec2-consistent-snapshot (https://launchpad.net/ec2-consistent-snapshot) that will manage this for you!

bluegene15y ago

xpaulbettsx15y ago

I think most people are doing RAID-0 to get more perf out of EBS volumes

2 more replies

acdha15y ago

I wish I had more than one upvote for this: swimming against a trend like that never works out well.

gruseom15y ago· 5 in thread

snorkel15y ago

krakensden15y ago

All of your EC2 instances can disappear without warning and everything on the local storage is now gone forever.

gruseom15y ago

That's the "backup to S3" part.

1 more reply

cachemoney15y ago

EBS-RAID0 is much faster for reads than local. Local is faster for writes.

riledhel15y ago

this seems to contradict several comments here. "citation needed".

2 more replies

absconditus15y ago· 5 in thread

How is it that Amazon.com is so reliable if there are so many problems with their "cloud" products? Do they not use the same software to run their site?

gpapilion15y ago

snorkel15y ago

1 more reply

weavejester15y ago

I suspect it's because amazon.com has different performance requirements. For instance, I imagine the read/write balance is very different for amazon.com than for reddit.com.

danielrhodes15y ago

Amazon.com is not hosted on EC2. It's entirely separate.

rbranson15y ago

This isn't entirely true. Amazon.com uses EC2 in addition to dedicated servers.

http://searchcloudcomputing.techtarget.com/news/1516269/Amaz...

2 more replies

snorkel15y ago· 4 in thread

watchandwait15y ago

My experience with the AWS RDS database product has been excellent.

krobertson15y ago

We looked at RDS and had a call with some of their engineers, but we basically had our EC2 + raid'd EBS set up almost the same as they did, all best practices already being done.

Since RDS really is EC2 + EBS, they couldn't provide any real assurances it performed better than our own installation.

We ended up moving off of AWS as a whole. After several discussions about how we can continue to scale, the ultimate answer was without AWS.

2 more replies

ceejayoz15y ago

The biggest issue I have with RDS is that I can't do a multi-master deployment to scale up writes. I've got a very write-heavy workload in my systems (roughly one write for every two reads).

1 more reply

btucker15y ago

tzs15y ago· 4 in thread

We've been looking at moving some or all of our stuff to either Amazon EC2/EBS/S3 or Rackspace cloud hosting, and it has been interesting.

Amazon seems more flexible, since you buy block storage (EBS) independent of instances. If you have an application that needs a massive amount of data, but only a little RAM and CPU, you can do it.

delano15y ago

If you're interested to see how sites perform on EC2 and Rackspace over time:

https://www.blamestella.com/vendor/ec2

https://www.blamestella.com/vendor/rackspace

bretpiatt15y ago

You're monitoring from AWS US-East it looks like, you'll want to mention that to give people some context around the latency numbers.

1 more reply

icey15y ago

I use the Rackspace cloud for a few Windows servers. The experience has been mostly positive, but they disappear for a few minutes each week, which is kind of troubling.

metageek15y ago

Disappear as in they crash and reboot, or disappear as in they're unpingable?

1 more reply

jclouds-fan15y ago· 4 in thread

Why is reddit relying on only one cloud provider? AWS can/should do better but service providers of the size of reddit should be using mult-vendor set-ups for sure.

bhousel15y ago

jvanenk15y ago

rworth15y ago

1 more reply

fuzzmeister15y ago

Is a multi-provider setup common? I certainly think Reddit should be on multiple availability zones within AWS, but spanning multiple providers seems hugely more difficult.

parasubvert15y ago· 2 in thread

khafra15y ago

> modulo hiccups that can bring it down by a factor of 5.

The comments on reddit indicated hiccups more on a factor of 10x and, sometimes, 100x.

Either way, the issue is that the more drives you add to your RAID0, the more often one of those drives experiences a "hiccup," and kills the performance of the entire volume.

parasubvert15y ago

It's not clear this was a single volume problem so much as an issue with one or more network switches in that availability zone (if you look at the AWS service health notes for that date).

Even in your own data centre, if your FC fabric goes wonky, your whole SAN is hosed.

Kilimanjaro15y ago· 2 in thread

Lesson for startups: start in the cloud, grow your business, build your own cloud.

Never trust critical parts of your business to others.

mkramlich15y ago

spidaman15y ago

floodfx15y ago· 2 in thread

snorkel15y ago

fuzzmeister15y ago

Exactly. While Amazon clearly tries to make single-zone reliability as good as possible, I think they expect customers to use a multi-AZ setup if they expect true reliability.

Zak15y ago· 2 in thread

saurik15y ago

Are you certain the data left the filesystem buffer and actually got acknowledged by EBS?

Zak15y ago

No; I'm very much a beginner when it comes to EC2. I unmounted the filesystem, detached the volume, then shut down the instance.

Andys15y ago· 2 in thread

But you're still sharing the same hardware as everyone else and its still just commodity hardware.

smhinsey15y ago

For what it's worth, it's not entirely accurate to say that you are always using shared hardware on AWS, at least for your servers. It depends on how you set up your environment.

weavejester15y ago

jedsmith15y ago· 1 in thread

[1]: http://www.reddit.com/r/blog/comments/g66f0/why_reddit_was_d...

magicseth15y ago

hemancuso15y ago· 1 in thread

Hard drives are unreliable and they certainly don't fail independently of one another - but the independence of their failure is much more independent than EBS.

saurik15y ago

prakash15y ago· 1 in thread

We (Cedexis) presented our findings on - How do EC2's East, West, EU & APAC zones compare: (pdf) http://www.cloudconnectevent.com/2011/presentations/free/76-...

If you would like to know more please send me an email: prakash [at] cedexis.com

jerf15y ago

You should post that to HN, if you haven't already. Possibly wrap a blog post around it.

steve91815y ago· 1 in thread

This very moment our team is restoring Postgres volumes because the EBS volumes our primary and secondary were on both failed simultaneously.

obfuscate15y ago

Were both in the same availability zone?

danielrhodes15y ago· 1 in thread

What's the failure rate of EBS versus having direct access to physical disks? My guess is that at scale, it's probably similar.

bmurphy15y ago

Both types of drives CAN and DO fail, so RAID-10, fail over, and replication are a must have.

j_s15y ago· 1 in thread

Being totally new to AWS, why does everyone skip right past using ZFS?

jodrellblank15y ago

PaulHoule15y ago· 1 in thread

I love the idea behind EBS, a SAN makes life so much easier, but I too find that EBS glitches are the largest cause of unreliability in AWS.

I'm not immediately planning to move out of AWS, but the trouble with EBS has certainly got me thinking about other options and has made me much less inclined to make an increased commitment to AWS.

drivebyacct215y ago

EBS is not a SAN which is largely the point being made in these comments and in the other HN article on reddit's post mortem.

lurker1715y ago· 1 in thread

EMR is a mess too. The Amazon-blessed Pig is almost a year and 2 major releases behind, and the official EMR documentation seems to describe a version of EMR that doesn't even exist.

"Elastic" is AWS's claim to fame, but I am not seeing it.

(Why would Amazon even implement a "master only job flow", which serves no purpose at all?)

adpowers15y ago

Also, if you want Pig you should complain about it vocally on the EMR forum. That is the best way to get them to listen to you.

SemanticFog15y ago

1 more reply

bmurphy15y ago

200gb really isn't all that big of a database. It shouldn't have to be this hard.

jread15y ago

The guys from Reddit also spoke about their use of EC2. Apparently they are running entirely on m1 instances which suffer from notoriously poor EBS performance relative to m2 and cc1/cg1 instances.

obfuscate15y ago

(Edit: I hadn't considered the possibility of somehow killing all my instances through human error. Ouch. That probably warrants one slave on EBS per AZ.)

cpg15y ago

This seems too much of a coincidence.

We (wrongly?) assumed reliability and (decent) performance from AWS.

natch15y ago

Isn't EBS intended for stuff like Hadoop job temporary data used during processing?

amitraman115y ago

We used Amazon and got bad performance in the beginning too. It is bad when you pull files out of S3. By bad I mean the latency is high.

We tried GoGrid and they lost or crashed our server instance.

I've personally used Rackspace, so far so good, but I've only been doing development on it.

yuhong15y ago

On the comment itself, I have this: http://news.ycombinator.com/item?id=2339715

j / k navigate · click thread line to collapse