I really think that one paragraph in his blog post summed everything up quite nicely. It could not ring more true:
My opinion is that the only reason the big enterprise storage vendors have gotten away with network block storage for the last decade is that they can afford to over-engineer the hell out of them and have the luxury of running enterprise workloads, which is a code phrase for “consolidated idle workloads.” When the going gets tough in enterprise storage systems, you do capacity planning and make sure your hot apps are on dedicated spindles, controllers, and network ports.
And now you have situations on a regular basis where you type "ls" and you shell hangs and not even "kill -9" is going to save you. And you go back to using FTP or some other abstraction that does not apply 40,000 hour MTBF thinking to equipment that disappears for coffee breaks daily.
The great quote by Leslie Lamport: "A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable."
And this was an excellent and honest article about faulty programming abstractions. It's basically bashing you over the head with the "Fallacies of distributed computing". Don't silently turn local operations into remote operations. They're not the same thing and have to be treated differently at all levels.
Even Werner Vogels wrote a diatribe against "transparency", which is the same issue by another name: http://scholar.google.com/scholar?cluster=700969849916494972...
So I wonder what he thinks of this architectural choice. You have to give up something when communicating over the network. Vogels seems to have chosen consistency rather than availability in his designs. This paper was a turning point in his research. Its candor surprised me.
The file system interface does not let you relax consistency, so by default you have chosen availability. As the Joyent guys honestly remarked, this often has to be learned the hard way.
An NFS server is very simple. With NFS on it's own VLAN, and some very basic QoS, there's no reason an NFS server should be the weak point in your infrastructure. Especially since it's resilient to disconnection on a flaky network.
If you're looking for 100% availability, sure, NFS is probably not the answer. If on the other hand you're running a website, and would rather trade a few bad requests for high-availability and portability, then NFS can be a great fit.
None of that has anything to do with EBS or block-storage though.
Joyent's position is that iSCSI was flaky for them because of unpredictable loads on under-performing equipment. The situation would degrade to the point that they could only attach a couple VM hosts to a pair of servers for example, and they were slicing the LUNs on the host, losing the flexibility networked block-storage provides for portability between systems.
Here's what we do:
We export an 80GB LUN for every running application from two SAN systems.
These systems are home-grown, based on Nexenta Core Platform v3. We don't use de-dupe since the DDT kills performance (and if Joyent was using it, then is local storage without it really a fair comparison?). We provide SSDs for ZIL and ARCL2.
These LUNs are then mirrored on the Dom0. That part is key. Most storage vendors want to create a black-box, bullet-proof "appliance". That's garbage. If it worked maybe it wouldn't be a problem, but in practice these things are never bullet-proof, and a failover in the cluster can easily mean no availability for the initiators for some short period of time. If you're working with Solaris 10, this can easily cause a connection timeout. Once that happens you must reboot the whole machine even if it's just one offline LUN.
It's a nightmare. Don't use Solaris 10.
snv_134 will reconnect eventually. Much smoother experience. So you zpool mirror your LUNs. Now you can take each SAN box offline for routine maintenance without issue. If one of them out-right fails, even with dozens of exported LUNs you're looking at a minute or two while the Dom0 compensates for the event and stops blocking IO.
These systems are very fast. Much faster than local storage is likely to me without throwing serious dollars at it.
These systems are very reliable. Since they can be snapshotted independently, and the underlying file-systems are themselves very reliable, the risk of data-loss is so small as to be a non-issue.
They can be replicated easily to tertiary storage, or offline incremental backup easily.
To take the system out, would require a network melt-down.
To compensate for that you spread link-aggregated connections across stacked switches. If a switch goes down, you're still operational. If a link goes down, you're still operational. The SAN interfaces are on their own VLAN, and the physical interfaces are dedicated to the Dom0. The DomU's are mapped to their own shared NIC.
The Dom0, or either of it's NICs is still a single point of failure. So you make sure to have two of them. Applications mount HA-NFS shares for shared media. You don't depend on stupid gimmicks like live-migration. You just run multiple app instances and load-balance between them.
You quadruple your (thinly provisioned) storage requirements this way, but this is how you build a bullet-proof system using networked storage (both block (iSCSI) and filesystem (NFS)) for serving web-applications.
If you pin yourself to local storage you have massive replication costs, you commit yourself to very weak recovery options. Locality of your data kills you when there's a problem. You're trading effective capacity planning for panic fixes when things don't go so smoothly.
This is why it takes forever to provision anything at Rackspace Cloud, and when things go wrong, you're basically screwed.
Because instead of proper planning, they'd rather not have to concern themselves with availability of your systems/data.
It's not a walk in the park, but if you can afford to invest in your own infrastructure and skills, you can achieve results that are better in every way.
Sure, you might not be able to load a dozen high-traffic Dom0's onto these SAN systems, but that matters mostly if you're trying to squeeze margins as a hosting provider. Their problems are not ours...
When you move sqlite to NFS, for example, file locking probably won't work. There is nothing to tell you this.
It sounds like you have experience making NFS work well, but I don't see how anything you wrote addresses this point. In fact I think you're just echoing some of the article's points about "enterprise planning". AFAICT you come from the enterprise world and are advocating overprovisioning, which is fine, but not the same context.
If NFS were implemented totally in userspace (like FTP), it would not hang the entire system when something breaks. On the other hand, it would be much slower than it is, therefore it would be unsuitable for a lot of use-cases where it is used now.
I think the old CVS quote by Tom Lord applies here:
CVS has some strengths. It's a very stable piece of
code, largely because nobody wants to work on it anymore.These things will get much worse before they get better, and it's best to think of all these abstractions as being a double edge sword.
Regardless, I do agree that building your application today like it is a solved problem is the wrong way to do it.
That presumption assumes that the application is being used as the right tool to resolve the problem. And it also assumes that "the problem" is a finite and solvable item.
Yes. To make this a bit more concrete, if "the problem" is making distributed storage look and behave exactly like local storage, the CAP Theorem has something to say about its solvability.
We used to store and process all of our uploads from our rails app on a GFS partition. GFS behaved like a normal disk most of the time, but we started having trouble processing concurrent uploads and couldn't replicate in dev.
It turned out so GFS could work at all, it had different locking than regular disks. Every time you created a new file it had to lock the containing folder. We solved it by splitting our upload folder in 1000 sequential buckets and wrote each upload to the next folder along... but it took us a long time to stop assuming it was a regular disk.
We now pay a lot more attention to underlying stack. Just because you've outsourced hosting (either cloud or managed physical servers), you really need to know every component yourself.
Even for people who didn't use EC2 the existence of the platform caused more people to rethink their architectures to try to rely less on Important Nodes.
EBS is a step back from that philosophy and it's a point worth noting.
One of the great things this post does is enumerates some of the underlying reasons why relying on EBS will inevitably lead to more failures and in ways that are harder and harder to diagnose.
Amazon doesn't use EBS itself, right? Isn't EBS something that AWS allowed its customers to nag it into against (what it considers) its better judgement?
Only our master and our slave backup server runs on EBS. We aren't as write oriented so we can live with some of the limitations of EBS, but we've even considered moving our master MySQL and Mongo servers to ephemeral storage and just relying on our slave back database server to run on EBS (for which we take freeze/snapshots of often). That server rarely ever falls more behind in relay updates.
They are typically available as /dev/sd[bcde]
In centOS, implementing a RAID-0 block device across 2 ephemeral disks that is present on an m1.large instance can be done via the following:
mdadm --create /dev/md0 --metadata=1.1 --level=0 --quiet --run -c 256 -n 2 /dev/sdb /dev/sdc
You'll then need to format the block device with your fs of choice. Then mount it from there.
Think of a bridge between high performance disk and tape.
This may be true under Solaris. Since 2.5 Linux has had /proc/diskstats and an iostat that shows the average i/o request latency (await) for a disk, network or otherwise. For EBS it's 40ms or less on a good day. On a bad day it's 500ms or more if your i/o requests get serviced at all.
Edit: my point is you can't hide unexpected/unknown events on statistical models; we should know better, coming from CS.
Actually, it was discovered some time ago (http://openfoo.org/blog/amazon_ec2_underlying_architecture.h...) that EBS probably used Red Hat's open-source GNDB: http://sourceware.org/cluster/gnbd/
As Schopenhauer said, every man mistakes the limits of his own vision for the limits of the world, and these are people who've failed to Get It when it comes to distributed storage ever since they tried and failed to make ZFS distributed (leading to the enlistment of the Lustre crew who have also largely failed at the same task). If they can't solve a problem they're arrogant enough to believe nobody can, so they position DAS and SAN as the only possible alternatives.
Disclaimers: I'm the project lead for CloudFS, which is IMO exactly the kind of distributed storage people should be using for this sort of thing. I've also had some fairly public disputes with Bryan "Jackass" Cantrill, formerly of Sun and now of Joyent, about ZFS FUD.
The SAN solutions they migrated to are not ZFS based. Unless I'm mis-remembering (I read this a couple days ago) they were only using ZFS to slice LUNs.
Point is, you're taking pot-shots at ZFS when the main thrust appears to be: "It was hard to make iSCSI reliable. Once we did, by buying expensive storage-vendor backed solutions, we found it wasn't financially compelling."
They're a hosting provider. If it takes a replicated SAN pair (which is the wrong way to go about it BTW, though admittedly it's the way the storage vendors and their "appliance" mentality would have it done) to service just a pair of VM hosts (they're still using Zones right?) then it just didn't make sense money-wise for them. If they planned capacity to provide great performance, they weren't making enough money on the services for what they were selling them for.
That's not an "iSCSI is unreliable" problem. It's not a "networked storage is broken" problem. It's not a "networked storage is slow" problem. It's not even a "ZFS didn't work out" problem.
If you go out and spend major bucks on NetApp, not only are you going to have to deal with all the black-box-appliance BS, but it's going to cost a lot of money. A LOT. And DAS is going to end up cheaper to deploy, maintain, and your margins are going to be a lot higher.
DAS is the right choice for a hosting provider who wants to maximize their profits in a competitive space.
It's not the best choice for performance, availability or flexibility for clients though. So you have to ask yourself what kind of budget you have to work with, and what goals are important to you?
BTW, there's _budget_, and then there's NetApp/EMC budget. Just because you need/want more than DAS can give you doesn't mean you need to tie your boat to an insane Enterprise grade budget.
As for "DAS is the right choice" that's just wrong on many levels. First, people who know storage use "DAS" to both private (e.g. SATA/SAS) and shared (e.g. FC/iSCSI) storage, so please misusing the term to make a distinction between the two. Second, I don't actually recommend either. I don't recommend paying enterprise margins for anything, and I don't recommend more than a modicum of private storage for cloud applications where most data ultimately needs to be shared. What I do recommend is distributed storage based on commodity hardware and open-source software. There are plenty of options to choose from, some with all of the scalability and redundancy you could get from their enterprise cousins. Just because some people had some bad experience with iSCSI or DRBD doesn't mean all cost-effective distributed storage solutions are bad and one must submit to the false choice of enterprise NAS vs. (either flavor of) DAS.
In short, open your eyes and read what people wrote instead of assuming this is the NAS vs. DAS fight you're used to.