Skip to content

Top Best Ask Show New Jobs

Is Amazon's cloud service too big to fail? (opens in new tab)

(fnlondon.com)

182 pointsazureel8y ago156 comments

156 comments

83 comments · 20 top-level

dalbasal8y ago· 13 in thread

This is (I was surprised) a pretty good article. Financial services are regulated and based on recent experience, they're concerned with systemic risk. Most industries do not have anyone responsible for worrying about this kind of thing.

It seems reasonable to start worrying about the fragility potentially introduced by these massive internet infrastructure companies.

peteretep8y ago

If you wanted to blow something up to make the west suffer, an AWS datacenter would probably be a pretty good target. I wonder at what point that becomes a legitimate national security concern, and the government steps in to provide protection.

earthboundkid8y ago

Yes, this goes to show that there are ~zero terrorists in the US. The US has an almost infinite supply of soft targets, but since 9/11 there have been no attacks of any consequence. There have been a few minor attacks like Boston bombers, but those have all been laughably amateurish.

The US has no terrorism problem. In 1979, the Irish killed the Queen's uncle in law, and in 1984, they blew up a hotel where Thatcher was staying. That's a serious terrorism problem! What the US has is nothing by comparison. We were unlucky on 9-11, and ever since we've been distorting our foreign policy out of unreasonable fear.

gaius8y ago

legitimate national security concern, and the government steps in to provide protection

There are a LOT of far softer targets that go unprotected. A terrorist attack on a sewage plant for a major city would be far more devastating than knocking out a few websites.

erentz8y ago

A great soft target is electricity grid. The timelines for manufacturing large transformers for example can be many months and they usually don't have as much redundancy as you would expect. A semi coordinated attack against the grid, using dynamite to take out a number of pylons along main HV lines, easily accessible because they're out in rural areas. Combine with a number of sticks thrown over fences at transformers. And you could massively impact the US economy for months.

mediascreen8y ago

Wouldn't you have to blow up at least all the datacenters in a region to make an impact?

lithos8y ago

Amazon's own data centers are tiny (compared to most) that taking one out would almost be a waste of time. Even one of their large colocations would probably be a waste.

If you wanted that type of destruction, and be noticed, you would need a city leveler type event in a data center heavy area.

Most people wouldn't even notice. They'd have to blow up a region and at that point, you have bigger problems.

tyfon8y ago

I really hate the "too big to fail" meme and I strongly agree with Bernie in that if you are too big to fail you are too big to exist.

That should be the priority.

I don't know what Bernie's specific plans were/are but it's not unusual. At least, at first blush, this is a lot of people's reactions. Too-big-to-fail = danger = make it smaller.

Realisitically, this has not been the solution implmented (in the EU & US, at least). In the EU, it is even more crucial as the "solutions" to this problem are applied to state finances as well as financial institutions.

In terms of policies, there are two competing approaches: (1) Reduce the size of "too-big-to-fail" institutions. (2) Regulate them more heavily (or some other strategy) so that they will not fail. In the EU, this is being applied to states, not just financial institutions. Rules that (supposedly) reduce catastrophic risk.

Almost all seripous policy proposals are in the no. 2 category. Tighten regulation, reduce the risk of failure. Tighter regulation lends to stronger incumbents and larger average company size so by doing 2, you are probably doing the opposite of 1.

As I said, I don't know what Bernie's proposal is or how mature it is as a policy (as opposed to a politician statement). It would be notable if a left wing politician propsed loosening bank regulations, though definitely not impossible or unreasonable.

TrickyRick8y ago

Agreed, it seems like the software approach as well. You wouldn't want a class or a piece of code to be "too big" to fail, you'd refactor it into smaller pieces which can be overviewed more easily.

awkwarddaturtle8y ago

> I really hate the "too big to fail" meme and I strongly agree with Bernie in that if you are too big to fail you are too big to exist.

That seems like the most reasonable response. And yet, since the great recession, our policy has been "make 'too big to fail' even bigger".

The problem is that the banks have become too powerful for anyone to challenge. A Teddy Roosevelt type of political leader can't exist today.

658278y ago

Why is every arrangement of characters now a "meme". That word has moved beyond devoid of meaning, at this point it's like a black hole of nothingness of a word.

pferde8y ago

> Financial services are regulated and based on recent experience, they're concerned with systemic risk. Most industries do not have anyone responsible for worrying about this kind of thing.

I'd say that most industries do not have anyone responsible for worrying about it high enough in the management chain.

smegel8y ago· 12 in thread

Is it possible for AWS to have a multi-region outage - as in is there anything connecting them that could bring them all (or several) down at once?

(Apart from the result of a botched patching or update to the core software stack that was done worldwide at the same time and hopefully never happens).

askvictor8y ago

A cascading electrical grid failure? I don't know if there are any interconnects between the regions with the DC's, but if there were that might be a concern. Though at that stage, presumably most of the US is without power, hence not so much need for AWS.

I think each DC has at least two power sources and probably a backup generator. I think that's why cloud providers have been so reluctant to open in Africa, diversified power is apparently a problem.

smegel8y ago

Well I guess a nationwide power outage will have bigger implications than Netflix going down...

TrickyRick8y ago

Just bringing down us-east was enough to cause quite a bit of trouble recently: https://aws.amazon.com/message/41926/

mrep8y ago

That would go against a core principle at aws which is to have every region completely isolated.

Also, deployments are designed to be exponential and no region should ever have a cross region dependency.

dijit8y ago

Unless you work at amazon, you can't know that.

It looks very separated on the outside, but I've worked in so many companies that have appeared incredibly competent externally but have "snowflake" servers which keep things ticking over- Given Bezos treatment of workers I have absolutely no confidence that everything is as cleanly engineered as they claim.

dantiberian8y ago

My memory may be wrong, but I thought several regions were affected by the recent S3 outage? Also, I suspect that if us-east-1 went down completely that that would have a debilitating effect on the others.

Well given that you can manage services across all regions from a single web interface, I assume someone compromising this web interface would be able to control and bring stuff down across all regions.

jrimbault8y ago

Do you think there's a global world admin web console ?

blazespin8y ago

Yes, there are ways to bring down all of their arch at once, but you'd have to get through a lot of barriers to do it.

A major solar flare and coronal mass ejection? It wouldn't just be Amazon that was affected, though.

HatchedLake7218y ago

No. Hence them rolling out new features region by region.

barsonme8y ago· 11 in thread

Even at a smaller scale it is a little nerve-wracking to know be so reliant on one provider. If AWS tanks there's a fair amount of code that'd need to be changed just to switch over to Azure or GCE. Failover with, e.g., email providers is easy enough, but the entire cloud stack (for lack of better terms) is a completely different ballgame.

PaulKeeble8y ago

It is one of the issues with choosing the cloud providers and taking their stack. They are hoping the cost of swapping once bought into their way is too costly to a competitor who can offer similar service cheaper. Lockin used to be considered bad but something changed with cloud providers and ops/developers don't seem to care as much anymore.

wiz21c8y ago

Maybe because pricing by, say Amazon, is published on their web site and therefore, the same for everyone ? Whereas before, when you were with one supplier, he could make specific price for you and leverage its position to make you pay more ? dunno...

unkown-unknowns8y ago

There are some open source implementations of parts of the APIs of cloud providers that might help someone a bit when trying to migrate. For example, Minio [1] [2] implements the AWS S3 v4 API.

[1]: https://news.ycombinator.com/item?id=12392081

[2]: https://minio.io/

I warn other developers at my company about this. When new projects spin up they're often very excited about using new Amazon services and will make any excuse to choose an AWS product over a stable open source solution. If I were a manager, I'd be very worried over the vendor lock-in.

I don't understand the preference for AWS over open source in many cases. Their services are "reliable", but they often have minute restrictions that will eventually bite you. You also end up having to pay for something you could get for free. Why use SNS/SQS when there are free pubsub/message buses out there? Most of the other devs justify this with the argument of not having to maintain the software themselves. "But RabbitMQ might crash! We don't have to worry about that with AWS!"

Anyway, I typically minimize the AWS services I use (S3, EC2, ECS) so I don't dread the day AWS blows up or, more likely, some VP or exec says we're moving to GCP/Azure because we got a better deal.

kevan8y ago

>Why use SNS/SQS when there are free pubsub/message buses out there?

Free is never really free. There's always a tradeoff in engineering time and money when you choose to run your own stack instead of paying to use a stable, well-established service. Oftentimes running your own will be cheaper overall, but you have to do that cost-benefit comparison for yourself.

You're also forgetting that if you set up something on your own you also have all the hardware concerns as well. You need to procure hosts, provision them properly, deploy them, monitor them, scale them, fix them. That infrastructure cost doesn't go to zero but it is significantly reduced using a cloud provider.

> "But RabbitMQ might crash! We don't have to worry about that with AWS!"

I can confirm that not only can RabbitMQ get into an unusable state, it will do so extremely rapidly and with little warning unless you sit an engineer or two on it to monitor and manage the incoming/dead letter rates.

tylersmith8y ago

AWS provides a lot of features that are exclusive to their platform and can't be drop-in replaced on other providers like Azure of GCE. ELB, EFS, S3, ASGs, etc. They'd need to be replaced at the application level for other platforms. That could be a huge commitment for a decent sized system.

I don't know about ELB, EFS and ASG but:

- S3 has a public protocol and many 3rd party providers support it (OpenIO, Scality, Ceph, Minio, etc),

- EFS could be replaced with something like DRDB or GlusterFS, or DigitalOcean's block storage or Google Cloud's networked disks.

- ELB could be replaced easily with similar services from other providers [1] if you use Kubernetes (I don't know if all have a LoadBalancer type though)

I would be more concerned about firewall/vpc rules, because I have no idea how those could be migrated without risk of forgetting some. Lock-in seems not that high in the end though and even less so if you use an open source container orchestration stack because they abstract most of these things away.

[1] https://kubernetes.io/docs/tasks/access-application-cluster/...

awkwarddaturtle8y ago

It's amazing how the promise of "decentralized" internet has turned into centralized datacenters.

P2P networks, each computer being a "data store" on the internet, no one entity can control data, etc to modern day centralized cloud where a couple of players control so much.

There has been a cultural shift. In the early 2000s, the idea of storing your data somewhere else would have been weird. But now, people don't care about keeping their data on apple/google/etcs data centers.

I think it has to do with the fact that computer/internet illiterate people are now the majority whereas in the 90s/early 2000s, it was generally the computer literate on the internet.

kirykl8y ago

I was pretty befuddled when my company IT switched from self hosted storage to commercial cloud accounts for incredibly sensitive info.

I think the reasoning was cloud accounts are easier for the masses than mapping a drive and accessing over VPN

AmIFirstToThink8y ago· 5 in thread

If your architecture means your system goes down if AWS is down, then the question becomes can you replace AWS with something better that you can build, have means to build, have time to build, can keep running, can get enough momentum in term of sheer size of customer base to fund the upkeep of the platform?

If you can't build/run a better AWS replacement then it's a mute point, isn't it?

Then the question turns into if you can't build better AWS, can you architect your application to handle AWS failures? AWS itself lets you handle many kind of failures at AZ/DC level. Are you using that? For global AWS outages, can you have skeleton, survival critical system running on GCP or Azure?

Have you thought about outages that would be out of your control and out of AWS's control e.g. malware, DDoS, DNS, ISP, Windows/Android/iOS/Chrome/Edge zero day? How are you going to handle outages due to those issues?

If you are prepared to handle outages (communication, self-preservation, degraded mode, offline mode) then can a serious AWS outage be managed just like those outages?

savoytruffle8y ago

irrelevant points are "moot", not "mute"

darkr8y ago

I think you mean "moo".

It's like a cow's opinion, you know, it just doesn't matter. It's "moo".

c228y ago

Moot points aren't really irrelevant, on the contrary, they're perhaps the most relevant as non moot points are already settled.

AmIFirstToThink8y ago

TIL

Thanks. :)

pavel_lishin8y ago

I think most people probably don't need an entire AWS replacement, though. If you're running an e-commerce platform, you could run it on a VPS. It would be harder - you'd have to do your own server management, figure out your own deploy strategy, run your own load balancers, do your own security, backups, etc. - but that's not "rebuild AWS from scratch" harder.

cm21878y ago· 4 in thread

What would be great is the equivalent of the ACME protocol for cloud service providers. That will take a while and shouldn't happen until the offering matures and stabilises. But in an ideal world you wouldn't tie your application to a specific cloud provider. You should be able to lift and shift to another provider.

Which I think is a merit of using VMs as opposed to individual services.

gaius8y ago

But in an ideal world you wouldn't tie your application to a specific cloud provider.

You can do that easily if you just treat clouds merely as hosted hypervisors and think entirely in terms of VMDKs. But this doesn't make commercial sense to do at least in the short term - you need to utilise the layered services you are paying for anyway or you might as well just run your own DC.

icebraining8y ago

It still makes sense for its elastic properties (from which EC2 got its name). You can't rent half a DC for an hour, but you can spawn generic instances from VMDKs on different providers with a fairly small abstraction layer.

ACME protocol?

cm21878y ago

Developped by let's encrypt, which helps solving the too big to fail problem with CA. When CA adopted it (which looks like it may happen), you will have a common protocol to create and renew certificates across CA.

cjsuk8y ago· 4 in thread

This does worry me. If there is a shortage of resources suddenly or a DC fire that takes out a region, then what?

We have contingency against this via our own infrastructure but I worry about organisations who don't have any.

> Amazon EC2 is hosted in multiple locations world-wide. These locations are composed of regions and Availability Zones. Each region is a separate geographic area. Each region has multiple, isolated locations known as Availability Zones. Amazon EC2 provides you the ability to place resources, such as instances, and data in multiple locations. Resources aren't replicated across regions unless you do so specifically.

Source: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-reg...

kondro8y ago

One region isn't going to be effected by fire. And AWS have dozens of regions. They're even managed as separate units by separate people. You'll notice there's never been a large, multi-region outage of AWS.

cjsuk8y ago

Yet.

Some of the traditional apps we host are vulnerable hypervisor failure be that rack, DC or region.

>This does worry me. If there is a shortage of resources suddenly or a DC fire that takes out a region, then what?

Then some businesses will be out for a few hours / days.

No big deal.

From WWII to 9/11 to Katrina (and whatever regional stuff we have), we have been through much worse than that in modern history.

martyvis8y ago· 3 in thread

It took me three reads of the first couple of paragraphs to realise that "snowball" and "snowmobile" were actually hardware products that you can touch. Tech news publishers need to do a jargon check and use appropriate punctuation, formatting or something to call out terms that 90% of readers would not have come accross

cdolan8y ago

Maybe its because I saw your comment before reading, but I had no problem understanding the first few paragraphs.

The author states that a "snowball" is a grey suitcase with 50tb of HDD space inside, and a "snowmobile" is a massive 18 wheeler with what I would assume is petabytes of storage.

It's probably because it's 5 in the morning here :-) But looking at Amazon's own references to the appliances, they always capitalise the name. I guess what I can only assume was intentional obscuring what are probably trademarks made it read poorly to me.

pavel_lishin8y ago

Really? It's explicitly stated in the very first paragraph, in the second sentence:

> Not the lumps of mush and ice that children chuck at each other, but Amazon’s portable information storage devices, big grey suitcases that hold huge amounts of data.

Capitalizing it might have helped, though.

blazespin8y ago· 3 in thread

The solution is pretty simple, AWS/Azure need to provide on premise versions of their cloud.. You'd probably get stuck with a particular version, but better than nothing.

That's pretty much what Azure Stack is:

https://azure.microsoft.com/en-gb/overview/azure-stack/

There might well be a commercial niche for providing Azure Stack hosting in non-Microsoft data centers.

I think there is a massive market for 100% cloud-compatible local deployments. In my personal experience every .Net shop I've seen would love to be incorporating more Azure goodness locally, but can't as they're cloud specific techs which bump into the realities of deployment and maintenance.

Personally, I think MS crapped the bed a little by taking Azure Stack off of commodity hardware and onto a combined hardware/software solution. Being able to deploy Azure-compatible solutions piece-meal locally would be a massive boon to governments, healthcare operations, and anyone working on a more thorough migration to the cloud.

Most of the EU, for example, has privacy regulation that makes cloud hosting impossible in some situations. Having a 'local Azure' would make it highly reasonable have all apps architected around Azures components and technology. Without the local deployment though you're kinda stuck with each foot in a different canoe... Hybrid infrastructures are highly favorable to DevOps and multi-party development scenarios.

nepotism20188y ago

Openstack is still going strong

galkk8y ago· 2 in thread

When I was working as contractor for one of big banks, which dev was concentrated on Canary Wharf, they weren't able to successfully complete disaster recovery testing on their primary database cluster for 2 years in a row, I just don't remember, was is department-wide or bank-wide.

Basically, each 6 months DR testing was failing and it was accepted as harsh reality. After seeing how they're working inside, I don't think that moving their infrastructure to AWS/Azure/Google is worst that could happen.

disc: Currently working at Amazon, but not at AWS.

Kenji8y ago

Why did they not redo the DR testing until it worked? Normally you iterate tests and bugfixes over and over until it works. Otherwise, what's the point of the test? Being confident that your stuff does not work at all?

galkk8y ago

It was bank-wide activity with defined schedule etc.

fovc8y ago· 2 in thread

I think about this problem every now and then for my own business, but not sure what the right answer is. Supporting multiple clouds requires more involved management of some pieces of infrastructure (e.g., DNS + healthchecks, DB replication), which introduces another point of failure.

How do people who need to have more nines of availability manage this issue with cloud providers? (EC2 and RDS promise 3.5 nines per AZ, but I imagine outages are somewhat correlated across zones)

dastbe8y ago

for people who need more 9s of availability on a single cloud provider, you have to start going multi-region. aws takes region isolation/independence very seriously, and along with geographic independence gives you effectively two entirely independent clouds which just so happen to have the exact same APIs. Some of the (really great) Netflix blog posts[0] have talked about multi-region services.

If you do go multi-cloud, I would be wary of picking regions that are located very close to each other. While you'll obviously get independent code and (likely) independent deployments, you're still susceptible to issues correlated with the physical location.

[0] https://medium.com/netflix-techblog/global-cloud-active-acti...

Very, very few businesses should be architecting to ensure higher than 99.95% availability, IMO. (Less than 4.5 hours of downtime per year.)

Users are patient enough to give you a pass if you're down that amount (especially if you're down that amount while 1/3rd of the internet is also down).

Our largest e-commerce retail site does over $1BB/yr in fairly high-margin sales and still targets "only" 99.95% availability (generally it exceeds that with actual results, but we don't target higher than that). It's a hybrid of on-prem and cloud services backing that, migrating towards the cloud, but will never be 100% cloud as we own and run factories with on-prem equipment.

(I know you asked "how" and I answered "whether", but I thought it relevant.)

sharemywin8y ago· 2 in thread

Hasn't anyone heard of disaster recover plans? I used to work at a medium sized insurance company and every year we had a project to update our disaster recovery plans. Including our main inhouse datacenter going down. If it was a critical system you'd better have a plan to get it back up in like 4 hours. and those were business critical we didn't have any life critical systems.

YawningAngel8y ago

What's the disaster plan for "DynamoDB doesn't exist any more"? There is literally nothing else like it in the world. I don't know of an idiot proof queue system that can handle the scales SQS can take either.

darkr8y ago

Cassandra?

Rabbit?

jondubois8y ago· 1 in thread

That's why I think containerization and orchestration will be useful; open source orchestrators can standardize the infrastructure and make switching seamless. That way the infrastructure remains a commodity.

lukeholder8y ago

Except you can't containerize the huge amounts of data you are storing can you?

zeep8y ago· 1 in thread

If Amazon's cloud service would disappear today, it would be a chaos for a week or two but most people should recover (as long as they have backups).

pavel_lishin8y ago

I'd wager most peoples' database backups live in AWS as well.

Plus, some people have huge, huge datasets. It could easily take weeks to migrate to, say, GCE, or to your own hosted servers. In the latter case, it would also necessitate a pretty large up-front investment.

jpalomaki8y ago

This goes to beyond having a plan-B for hosting your own stuff somewhere else. Think about all the 3rd party services you are depending on. Then think about how many dependencies those services have. How many trace back to Amazon on some level?

The connections that could cause problems may not be obvious. For example network provider running into trouble because a ticketing or monitoring system that depends Amazon does not work. Hardware supplier not being able to ship spare parts for your on-premise SAN because logistics company runs into trouble due to issues at Amazon.

forkLding8y ago

Personally as a dev, I find AWS's service in the middle of Paypal (shit, not sure why they're popular) to Stripe (Damn that was fast and easy) seeing as I used them both.

Their support is alright although you often have to pay for it but AWS docs are atrocious and remind me of university textbooks written by professors who like creating pseudo-scientific-sounding jargon which mixed with their huge array of features is quite un-comforting to use for even people with intermediate AWS experience (built some apps with AWS before kind of people).

I can see that there could be more specialized services like Firebase (which is built on Google Cloud) that should be built on AWS for the users. Firebase is a breeze to use and very responsive and I've used it to build real-time chat apps in a couple days.

acd8y ago

Cloud services are concentrated by nature built with the same cloned DNA. Of course that is a systematic risk with so much it concentrated to fewer physical locations running on the same code.

Think Cloned bananas vs fingers disease but computers. http://www.bbc.com/news/uk-england-35131751

nogbit8y ago

Yes and no. By design it's not big, it just seems big. With relative RPO and RTO anyone can failover to other regions. And if you aren't leveraging multiple AZ's within a single region you need to rethink how you are using AWS.

The very nature of AWS requires Amazon to build in capabilities to handle failover. But, as they say at Amazon, "everything fails, always".

jriot8y ago

Nothing is too big to fail. Society needs to be able to adapt and maintain a level of patience during transition times i.e., be patient when Amazon's cloud fails to a new tool.

For articles where the headline is a question, the answer is always "no".

No.

https://en.wikipedia.org/wiki/Betteridge%27s_law_of_headline...

j / k navigate · click thread line to collapse