Things You Should Know About AWS (opens in new tab)

(highscalability.com)

130 pointsjpmc12y ago50 comments

50 comments

40 comments · 13 top-level

vosper12y ago· 6 in thread

This is great and all, but I'd love to see a community wiki / discussion area for AWS tips, tricks, and gotchas. Something like Quora crossed with Wikipedia, just for AWS.

Is anyone aware of such a thing? If not, any advice for starting one?

lifeformed12y ago

Sounds like something ideal for a StackExchange site.

vosper12y ago

My problem with StackExchange sites is that they're not that good as a reference. There are lots of great questions and answers (and lots of repetition thereof) but as a place to collate community knowledge about a topic they fall short.

Something like a wiki might be better, I suppose.

kylelibra12y ago

Surprised it doesn't exist already.

dups12y ago

This would be great. I actually was searching for something like this only a week or so ago.

In my current role I have been doing a lot with Elastic Beanstalk. Many cool features are non obvious and/or not clearly documented. Especially in the case of extending its functionality with various hacks and recipes.

By now I have a fair accumulation of notes on this stuff, would happily pour this into a wiki.

miles93212y ago

http://aws.amazon.com/forums

sgs137012y ago

+1 this. I haven't seen any type of self-serving moderation here, and using this forum you can quickly see who knows what they're talking about, and if they are affiliated with Amazon or not.

jwilliams12y ago· 5 in thread

"9) Use Virtual Private Cloud (VPC) from the start"

This is now a no-brainer. New registrations in certain zones will kick you into a basic VPC from the get-go.

The only inbound should be via an ELB (HTTP/HTTPS) and an non-DNS-resolvable SSH bastion/NAT host (m1.small is more than enough). Your bastion is the only host that is on the public Internet.

Setting up a bastion is reasonably straight-forward. There is an AMI that will do it all for you. Just make sure you've got "src/dst check" turned off in the EC2 panel for that server. Just that tip can save you hours of hair-tearing.

Outbound is via the bastion, which you can lock down to certain protocols via VPC security groups. I limit to HTTPS (no HTTP!).

The bastion should use ssh keys only (no username/password). Put in place fail2ban on the bastion. You can also add a firewall rule that backs off multiple fails SSH attempts. This all but nukes brute-force attacks. Also, make sure you patch regularly.

I go as far as to have separate keys for the bastion and then the hosts, but sensible policies should apply here (e.g. passphrase on your keys please).

Keep the bastion superuser login to a very select group. You can quickly remove employees' access by taking them out of the bastion. If you're in panic mode, you can turn off the bastion and isolate your network until you can regroup (similarly with the ELB for web-based attacks).

This is a pretty sound foundation for a secure setup.

jrgnsd12y ago

Can you point to a blog post / article that describes this in greater detail?

jwilliams12y ago

Funnily enough, working on one :-) It's long overdue though...

joevandyk12y ago

Why not use a vpn on the bastion?

jwilliams12y ago

I guess you could, but I don't see if being that far different from SSH. One mild plus of SSH is you don't have addressing problems (e.g. Wifi network colliding with your AWS one).

bcoates12y ago

Isn't a VPN a much bigger hole than ad-hoc port-to-port SSH tunnels?

ideaoverload12y ago· 5 in thread

Description of noisy neighbor problem #6 lacks some depth.

AWS noisy neighbors problem is very often misunderstood. CPU steal time under linux does NOT mean that somebody is stealing your CPU. It simply means that you wanted to use CPU and hypervisor has given it to another instance. This may happen because you have exceeded your quota or scheduling algorithm selected another pending instance at this very moment and it would give CPU back to you a bit later. In the end in both cases your instance gets fair share.

Great detailed explanations of steal time: https://support.cloud.engineyard.com/entries/22806937-Explan... and: http://www.stackdriver.com/understanding-cpu-steal-experimen.... The latter article is mentioned by OP but seems to be not fully read/understood.

Why does killing and restarting instance help? It likely moves instance to different hardware node with less active neighbors. When your neighbor is not active your instance can use CPU idle cycles of your neighbor! You sort of become the noisy one. Still hypervisor would prevent it once neighbor starts to fully utilize his CPU quota and you are back to square one.

Amazon does not oversubscribe CPU according to their CTO: http://itknowledgeexchange.techtarget.com/cloud-computing/am...

Amazon specifically states that t1.micro instances do not guarantee CPU performance: "Micro instances are a very low-cost instance option, providing a small amount of CPU resources. Micro instances may opportunistically increase CPU capacity in short bursts when additional cycles are available. They are well suited for lower throughput applications and websites that require additional compute cycles periodically, but are not appropriate for applications that require sustained CPU performance."

While CPU sharing is pretty well documented noisy neighbor problem still exists for network and disk resources being shared by multiple instances on the same hardware node. The only way to detect these problems is to track network throughput/loss rate for network and IO stats for disk.

You are guaranteed to avoid noisy neighbors CPU problem by using AWS dedicated instances: http://aws.amazon.com/dedicated-instances/

I work for APM ( Application Performance Management) vendor , I have no business praising AWS.

[Edit:spelling and clarity]

[Edit2: changed CPU scheduling description]

miles93212y ago

"You are guaranteed to avoid noisy neighbors CPU problem by using AWS dedicated instances" This is incorrect. You simply ensure that the only ec2 instances that can be a noisy neighbor are other instances of yours. Since most users investing in dedicated instances tend to put their instances to work, this sometimes has the net effect of reducing performance.

AYBABTME12y ago

There was recently [a paper][1] published in ACM Transactions on Computer Systems where the argument was that if your instance is scheduled out by the hypervisor during TCP traffic, the latency in ACKing packets would reduce the network throughput of instances, because of the slow start mechanism.

Given that distributed network rely heavily on network latency and capacity, I found it very interesting to see the effect of busy CPU propagate to slower network IO.

[1]: http://friends.cs.purdue.edu/pubs/SC10.pdf

JulianWasTaken12y ago

Why does this happen even when the instance isn't using all of its CPU?

vacri12y ago

If you're sharing a CPU with a neighbour and you're using 3% and the neighbour is at 100%, the hypervisor needs to somehow squash that 103% into the 100% it has available - which means you lose a little bit of the scheduled time slots that otherwise would have been yours. If your neighbour drops down to 50%, there's now enough CPU time to handle both of you, so there's no 'steal'.

ideaoverload12y ago

>Why does this happen even when the instance isn't using all of its CPU?

Probably hypervisor does not assign CPU every time it is requested but it still manages to assign as much as needed because in the end there is some idle time left.

1 more reply

falcolas12y ago· 3 in thread

> Stripe your RDS disks for better performance

This is a fun hack to perform, but it opens you up to another problem: latency on any of the striped EBS volumes will lag out the entire striped array.

Attempts to mitigate this problem (including setting up raid 10) work in the short term, but it really is easier to just purchase a guaranteed iops volume if you want to run a database on EC2.

DrJ12y ago

RDS is the hosted sql database solution.

When RDS stripes your disks (I have seen it start when I jumped from 100GB to 300GB) you benefit from faster IOPS (I have seen 700 IOPS improve to 2500 IOPS.

PIOPS (Provisioned IOPS) is the way to go in the long run though.

falcolas12y ago

> you benefit from faster IOPS

Yes, you do.

However, if a single EBS that backs the RDS instance lags out (as EBS volumes in AWS are wont to do), you lose the iops.

The key point is that as the number of EBS volumes you depend on goes up, the chances of this happening goes up correspondingly.

Striping EBS volumes works, but I would not depend on it for a production environment.

rpedela12y ago

Yeah that is one reason we just use instance storage with database replication. Instance storage is generally faster and cheaper than EBS especially if you have a "very high I/o" instance.

leftnode12y ago· 3 in thread

What is the 'aws' command in example #2? Is that a Ruby/Python/other script released by Amazon or a 3rd party?

jeffbarr12y ago

That's the AWS CLI (Command Line Interface), available at http://aws.amazon.com/cli/ .

oulipo12y ago

Is there a huge difference with s3cmd?

5 more replies

leftnode12y ago

Awesome, thanks for that!

anan0s12y ago· 3 in thread

Actually I was wondering if there is any experience/benchmarks about Amazon high-performance instances.

anyone already used this ?

vosper12y ago

I am using a number of high-performance instances - cc2.8xlarge machines for EMR jobs, and cr1.8xlarge machines for analytics databases. From my testing:

- Running EMR jobs on cc2.8xlarge machines as spot instances is a great way to get a LOT of computer power very cheaply. Because our jobs are periodic we run both Core and Task as spots and simply retry the job if our machines get terminated. I did a lot of benchmarking and found that a small number of cc2.8xlarge machines out-performs and is cheaper than a large number of lesser instances (and I tried most of the lesser machines). In us-west-2 it's very uncommon to lose our instances, unlike us-east-1 which has major price fluctuations (this is true for all types of spot instance).

- The cr1.8xlarge has fantastic performance, relative to the rest of the AWS machines. It's also very expensive compared to the cost of hardware or a similar solution on another cloud provider. Since we're fully integrated with AWS and don't want to run our own hardware we're sucking up the cost for now, but it's definitely a sore-point in our budget. The cr1.8xlarge is also all-round a better machine than the hi.4xlarge, which has a lot of disk but is pitiful in terms of CPU.

mdellabitta12y ago

One thing to note with EMR: you still pay 25% of the ondemand price as overhead to use EMR. If you're bringing up and turning off clusters all the time, it's probably worth it, but you might want to look into using Whirr instead.

1 more reply

anan0s12y ago

great stuff!

thanks for sharing.

ebaxt12y ago· 1 in thread

An important gotcha regarding Cloudformation - there is no way to recover from a failed rollback (except maybe to contact amazon support). So it's basically only safe for initial setup of resources.

ojiikun12y ago

Failed rollbacks are definitely an exception, not a norm, and something support should know about!

darkarmani12y ago· 1 in thread

Does anyone understand the point they are making in #9 about VPC?

Are they suggesting using HAProxy in your public subnet and ELBs in your private subnets? Is this their reference to avoiding using a NAT box (actually PAT)? Don't you still need a HAProxy box in each of your AZs?

falcolas12y ago

I'm not sure, but I read it as ELB over your HAProxy instances (for high availability), and use HAProxy to do the intelligent load balancing (which would solve the "sticky" argument put forth by the author).

talonx12y ago

These are more "tricks on optimizing AWS usage" than things to know "about" AWS.

oakwhiz12y ago

>Use ZFS and RAIDZ with EBS

That's a really cool idea, and I'm curious to see what kind of performance losses will occur during normal usage.

ademarre12y ago

I don't hear enough people talking about vendor lock-in with AWS, or any other cloud provider for that matter. Too many of us are building our company's infrastructure on AWS with little regard for the cost of switching.

I try to stick with Amazon's IaaS offerings, only employing PaaS products when they offer considerable advantages over anything I could roll economically on bare infrastructure.

Ask yourself, "in terms of time or money, what would it cost me if this cloud product were discontinued without notice?"

AWS might lose a patent battle, get shut down by the government, or get crippled by hackers or a long-term service interruption. If that happened, how long would you be down while you struggle to get up and running at another provider? What if Amazon raises their prices to the point where you want to change cloud providers? Would the cost of switching be prohibitive?

victor_haydin12y ago

One more nice post about this topic: http://www.elekslabs.com/2013/11/aws-10-things-youre-probabl...

simonebrunozzi12y ago

It looks like that this post has been partly inspired by a recent presentation that I made: http://www.slideshare.net/simone.brunozzi/5-thingsyoudontkno...

j / k navigate · click thread line to collapse

50 comments

40 comments · 13 top-level

vosper12y ago· 6 in thread

This is great and all, but I'd love to see a community wiki / discussion area for AWS tips, tricks, and gotchas. Something like Quora crossed with Wikipedia, just for AWS.

Is anyone aware of such a thing? If not, any advice for starting one?

lifeformed12y ago

Sounds like something ideal for a StackExchange site.

vosper12y ago

Something like a wiki might be better, I suppose.

kylelibra12y ago

Surprised it doesn't exist already.

dups12y ago

This would be great. I actually was searching for something like this only a week or so ago.

By now I have a fair accumulation of notes on this stuff, would happily pour this into a wiki.

miles93212y ago

http://aws.amazon.com/forums

sgs137012y ago

+1 this. I haven't seen any type of self-serving moderation here, and using this forum you can quickly see who knows what they're talking about, and if they are affiliated with Amazon or not.

jwilliams12y ago· 5 in thread

"9) Use Virtual Private Cloud (VPC) from the start"

This is now a no-brainer. New registrations in certain zones will kick you into a basic VPC from the get-go.

The only inbound should be via an ELB (HTTP/HTTPS) and an non-DNS-resolvable SSH bastion/NAT host (m1.small is more than enough). Your bastion is the only host that is on the public Internet.

Outbound is via the bastion, which you can lock down to certain protocols via VPC security groups. I limit to HTTPS (no HTTP!).

I go as far as to have separate keys for the bastion and then the hosts, but sensible policies should apply here (e.g. passphrase on your keys please).

This is a pretty sound foundation for a secure setup.

jrgnsd12y ago

Can you point to a blog post / article that describes this in greater detail?

jwilliams12y ago

Funnily enough, working on one :-) It's long overdue though...

joevandyk12y ago

Why not use a vpn on the bastion?

jwilliams12y ago

I guess you could, but I don't see if being that far different from SSH. One mild plus of SSH is you don't have addressing problems (e.g. Wifi network colliding with your AWS one).

bcoates12y ago

Isn't a VPN a much bigger hole than ad-hoc port-to-port SSH tunnels?

ideaoverload12y ago· 5 in thread

Description of noisy neighbor problem #6 lacks some depth.

Amazon does not oversubscribe CPU according to their CTO: http://itknowledgeexchange.techtarget.com/cloud-computing/am...

You are guaranteed to avoid noisy neighbors CPU problem by using AWS dedicated instances: http://aws.amazon.com/dedicated-instances/

I work for APM ( Application Performance Management) vendor , I have no business praising AWS.

[Edit:spelling and clarity]

[Edit2: changed CPU scheduling description]

miles93212y ago

AYBABTME12y ago

Given that distributed network rely heavily on network latency and capacity, I found it very interesting to see the effect of busy CPU propagate to slower network IO.

[1]: http://friends.cs.purdue.edu/pubs/SC10.pdf

JulianWasTaken12y ago

Why does this happen even when the instance isn't using all of its CPU?

vacri12y ago

ideaoverload12y ago

>Why does this happen even when the instance isn't using all of its CPU?

Probably hypervisor does not assign CPU every time it is requested but it still manages to assign as much as needed because in the end there is some idle time left.

1 more reply

falcolas12y ago· 3 in thread

> Stripe your RDS disks for better performance

This is a fun hack to perform, but it opens you up to another problem: latency on any of the striped EBS volumes will lag out the entire striped array.

Attempts to mitigate this problem (including setting up raid 10) work in the short term, but it really is easier to just purchase a guaranteed iops volume if you want to run a database on EC2.

DrJ12y ago

RDS is the hosted sql database solution.

When RDS stripes your disks (I have seen it start when I jumped from 100GB to 300GB) you benefit from faster IOPS (I have seen 700 IOPS improve to 2500 IOPS.

PIOPS (Provisioned IOPS) is the way to go in the long run though.

falcolas12y ago

> you benefit from faster IOPS

Yes, you do.

However, if a single EBS that backs the RDS instance lags out (as EBS volumes in AWS are wont to do), you lose the iops.

The key point is that as the number of EBS volumes you depend on goes up, the chances of this happening goes up correspondingly.

Striping EBS volumes works, but I would not depend on it for a production environment.

rpedela12y ago

Yeah that is one reason we just use instance storage with database replication. Instance storage is generally faster and cheaper than EBS especially if you have a "very high I/o" instance.

leftnode12y ago· 3 in thread

What is the 'aws' command in example #2? Is that a Ruby/Python/other script released by Amazon or a 3rd party?

jeffbarr12y ago

That's the AWS CLI (Command Line Interface), available at http://aws.amazon.com/cli/ .

oulipo12y ago

Is there a huge difference with s3cmd?

5 more replies

leftnode12y ago

Awesome, thanks for that!

anan0s12y ago· 3 in thread

Actually I was wondering if there is any experience/benchmarks about Amazon high-performance instances.

anyone already used this ?

vosper12y ago

I am using a number of high-performance instances - cc2.8xlarge machines for EMR jobs, and cr1.8xlarge machines for analytics databases. From my testing:

mdellabitta12y ago

1 more reply

anan0s12y ago

great stuff!

thanks for sharing.

ebaxt12y ago· 1 in thread

An important gotcha regarding Cloudformation - there is no way to recover from a failed rollback (except maybe to contact amazon support). So it's basically only safe for initial setup of resources.

ojiikun12y ago

Failed rollbacks are definitely an exception, not a norm, and something support should know about!

darkarmani12y ago· 1 in thread

Does anyone understand the point they are making in #9 about VPC?

falcolas12y ago

talonx12y ago

These are more "tricks on optimizing AWS usage" than things to know "about" AWS.

oakwhiz12y ago

>Use ZFS and RAIDZ with EBS

That's a really cool idea, and I'm curious to see what kind of performance losses will occur during normal usage.

ademarre12y ago

I try to stick with Amazon's IaaS offerings, only employing PaaS products when they offer considerable advantages over anything I could roll economically on bare infrastructure.

Ask yourself, "in terms of time or money, what would it cost me if this cloud product were discontinued without notice?"

victor_haydin12y ago

One more nice post about this topic: http://www.elekslabs.com/2013/11/aws-10-things-youre-probabl...

simonebrunozzi12y ago

It looks like that this post has been partly inspired by a recent presentation that I made: http://www.slideshare.net/simone.brunozzi/5-thingsyoudontkno...

j / k navigate · click thread line to collapse