This is a fun hack to perform, but it opens you up to another problem: latency on any of the striped EBS volumes will lag out the entire striped array.
Attempts to mitigate this problem (including setting up raid 10) work in the short term, but it really is easier to just purchase a guaranteed iops volume if you want to run a database on EC2.
When RDS stripes your disks (I have seen it start when I jumped from 100GB to 300GB) you benefit from faster IOPS (I have seen 700 IOPS improve to 2500 IOPS.
PIOPS (Provisioned IOPS) is the way to go in the long run though.
Yes, you do.
However, if a single EBS that backs the RDS instance lags out (as EBS volumes in AWS are wont to do), you lose the iops.
The key point is that as the number of EBS volumes you depend on goes up, the chances of this happening goes up correspondingly.
Striping EBS volumes works, but I would not depend on it for a production environment.
Is anyone aware of such a thing? If not, any advice for starting one?
Something like a wiki might be better, I suppose.
In my current role I have been doing a lot with Elastic Beanstalk. Many cool features are non obvious and/or not clearly documented. Especially in the case of extending its functionality with various hacks and recipes.
By now I have a fair accumulation of notes on this stuff, would happily pour this into a wiki.
This is now a no-brainer. New registrations in certain zones will kick you into a basic VPC from the get-go.
The only inbound should be via an ELB (HTTP/HTTPS) and an non-DNS-resolvable SSH bastion/NAT host (m1.small is more than enough). Your bastion is the only host that is on the public Internet.
Setting up a bastion is reasonably straight-forward. There is an AMI that will do it all for you. Just make sure you've got "src/dst check" turned off in the EC2 panel for that server. Just that tip can save you hours of hair-tearing.
Outbound is via the bastion, which you can lock down to certain protocols via VPC security groups. I limit to HTTPS (no HTTP!).
The bastion should use ssh keys only (no username/password). Put in place fail2ban on the bastion. You can also add a firewall rule that backs off multiple fails SSH attempts. This all but nukes brute-force attacks. Also, make sure you patch regularly.
I go as far as to have separate keys for the bastion and then the hosts, but sensible policies should apply here (e.g. passphrase on your keys please).
Keep the bastion superuser login to a very select group. You can quickly remove employees' access by taking them out of the bastion. If you're in panic mode, you can turn off the bastion and isolate your network until you can regroup (similarly with the ELB for web-based attacks).
This is a pretty sound foundation for a secure setup.
That's a really cool idea, and I'm curious to see what kind of performance losses will occur during normal usage.
I try to stick with Amazon's IaaS offerings, only employing PaaS products when they offer considerable advantages over anything I could roll economically on bare infrastructure.
Ask yourself, "in terms of time or money, what would it cost me if this cloud product were discontinued without notice?"
AWS might lose a patent battle, get shut down by the government, or get crippled by hackers or a long-term service interruption. If that happened, how long would you be down while you struggle to get up and running at another provider? What if Amazon raises their prices to the point where you want to change cloud providers? Would the cost of switching be prohibitive?
anyone already used this ?
- Running EMR jobs on cc2.8xlarge machines as spot instances is a great way to get a LOT of computer power very cheaply. Because our jobs are periodic we run both Core and Task as spots and simply retry the job if our machines get terminated. I did a lot of benchmarking and found that a small number of cc2.8xlarge machines out-performs and is cheaper than a large number of lesser instances (and I tried most of the lesser machines). In us-west-2 it's very uncommon to lose our instances, unlike us-east-1 which has major price fluctuations (this is true for all types of spot instance).
- The cr1.8xlarge has fantastic performance, relative to the rest of the AWS machines. It's also very expensive compared to the cost of hardware or a similar solution on another cloud provider. Since we're fully integrated with AWS and don't want to run our own hardware we're sucking up the cost for now, but it's definitely a sore-point in our budget. The cr1.8xlarge is also all-round a better machine than the hi.4xlarge, which has a lot of disk but is pitiful in terms of CPU.
thanks for sharing.
AWS noisy neighbors problem is very often misunderstood. CPU steal time under linux does NOT mean that somebody is stealing your CPU. It simply means that you wanted to use CPU and hypervisor has given it to another instance. This may happen because you have exceeded your quota or scheduling algorithm selected another pending instance at this very moment and it would give CPU back to you a bit later. In the end in both cases your instance gets fair share.
Great detailed explanations of steal time: https://support.cloud.engineyard.com/entries/22806937-Explan... and: http://www.stackdriver.com/understanding-cpu-steal-experimen.... The latter article is mentioned by OP but seems to be not fully read/understood.
Why does killing and restarting instance help? It likely moves instance to different hardware node with less active neighbors. When your neighbor is not active your instance can use CPU idle cycles of your neighbor! You sort of become the noisy one. Still hypervisor would prevent it once neighbor starts to fully utilize his CPU quota and you are back to square one.
Amazon does not oversubscribe CPU according to their CTO: http://itknowledgeexchange.techtarget.com/cloud-computing/am...
Amazon specifically states that t1.micro instances do not guarantee CPU performance: "Micro instances are a very low-cost instance option, providing a small amount of CPU resources. Micro instances may opportunistically increase CPU capacity in short bursts when additional cycles are available. They are well suited for lower throughput applications and websites that require additional compute cycles periodically, but are not appropriate for applications that require sustained CPU performance."
While CPU sharing is pretty well documented noisy neighbor problem still exists for network and disk resources being shared by multiple instances on the same hardware node. The only way to detect these problems is to track network throughput/loss rate for network and IO stats for disk.
You are guaranteed to avoid noisy neighbors CPU problem by using AWS dedicated instances: http://aws.amazon.com/dedicated-instances/
I work for APM ( Application Performance Management) vendor , I have no business praising AWS.
[Edit:spelling and clarity]
[Edit2: changed CPU scheduling description]
Given that distributed network rely heavily on network latency and capacity, I found it very interesting to see the effect of busy CPU propagate to slower network IO.
Probably hypervisor does not assign CPU every time it is requested but it still manages to assign as much as needed because in the end there is some idle time left.
Are they suggesting using HAProxy in your public subnet and ELBs in your private subnets? Is this their reference to avoiding using a NAT box (actually PAT)? Don't you still need a HAProxy box in each of your AZs?