“About 40 minutes into Hour 6 your instance no longer has any credits remaining. At this point, you’re now limited to the baseline performance of the instance, in this case, 10% of the vCPU.
At this point your application grinds to a halt.”
Agreed. As terrible as that is, it’s actually worse than that in my experience, bc even while you have positive CPU credits, you can’t spend them down smoothly to achieve high sustained CPU - the scheduler gives you something more like a sine wave of performance. Which is mystifying to say the least and leads me to my final point about why T2 is so awful: it violates the principle of least astonishment. Why would anyone expect that a CPU is limited to a small fraction of its output? T2 should not be categorized in the “general computing” aws instance family. It should be in a family all its own called “burstable” that explains the major difference between it and the general computing family.
I mean, their entire purpose and value proposition is explicitly based on not needing to support sustained high traffic. This is not something that needs to be learned the hard way. You don't even need to RTFM, you just need to read the first sentence of the first Google result snippet:
> T2 instances are low-cost, General Purpose instance type that provide a baseline level of CPU performance with the ability to burst above the baseline.
In fact, they are even explicitly labeled as "burstable performance instances" as you suggest in the first five words of the first paragraph of the first Google result for "aws t2": https://aws.amazon.com/ec2/instance-types/#burst
Totally agree that it should not be "general computing" categorized though. That category in general is too vague to be useful really. They should clearly demarcate between production service hosts and development-oriented instance types.
It's perfectly safe to run a production site on a t1.micro, as long as you follow good principles regarding monitoring and autoscaling.
You'd be a fool to run a production website on any instance type without those failsafes, but with them in place, a t1/t2.micro is no worse than any other. And in fact, since they boot so quickly, they are really nice with auto-scaling.
I definitely agree, for high traffic applications a compute or memory optimized instance are more optimal choices due to the consistent performance, especially when price isn't as much of a concern.
In situations where cost is more of a concern and the application has "bursty" requirements for performance, I still believe the T2 instance has it's place.
The guys over at Cloudonaut have an excellent article on the importance of T2 burst credits to maintain application performance.
https://cloudonaut.io/burst-credits-of-t2-ec2-instances-need...
EDIT: The new T2 Unlimited option mentioned at the above article is also worth considering.
- executing recurring scheduled events with known estimate for cpu usage (cron jobs)
- potentially for non-overly busy build servers
I wouldn't use them for anything that needs production reliability though.
Only if you don't want to enable t2 unlimited -> https://docs.aws.amazon.com/en_us/cli/latest/reference/ec2/m...
But really seems like it would be simple enough to rig something up that would:
1. Trigger CloudWatch Alarm on Insufficient T2 Credits
2. Send SNS notification to SNS topic when triggered
3. have lambda function subscribed to enable t2 unlimited
4. have subscribed email or slack alert to be like "hey something's up, we enabled t2 unlimited for now"
Though CW alarms for this are limited to running every 5 min, so could take a bit to notify, or you might need to set your threshold a bit high at first. I usually prefer to configure a CW event rule for this kind of stuff, but I don't see any support for that out of the box.
At least they explain the math pretty decently for predicting the usage: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/t2-insta...
https://github.com/terraform-providers/terraform-provider-aw...
Good to see it's becoming more common knowledge though.
I was sure to add a note to the article after a few people here mentioned it.
Thanks for sharing your thoughts :).
> The examples above are contrived but it does illustrate the danger of not properly architecting your AWS system and monitoring it once it’s in place (especially if you’re using T2 instances).
> I often see this issue with a website that receives increases in the baseline traffic over long periods of time with occasional spikes. If the instance type is not changed and sized appropriately, this eventually leads to all the accrued CPU credits being used paired with downtime and a significant amount of head scratching.
This is so true. Proper architecture and planning up front will save all kinds of headaches. Anyone serious about using EC2 needs to understand and use autoscaling groups with ELBs to manage scaling. If production availability is important, and you don't want to be bugged to fix something at 3AM, ASGs are a nice way to mitigate for potential AZ issues (you can do that with an ASG with max size of 1 even!). The biggest problem I always saw was that people would not take the time to test what happens when an ASG instance is terminated. Also You can't be ssh'ing in to manually change things and expect those changes to stick around.
Proper monitoring and alerts if of course the other side to this.
Ex: - Say your ASG is set to 4 max instances, min 2
- You get a sustained increase in traffic and scale up to 4 instances
- Sustained 70% across 4 instances consumes more credits than are earned
- As instances run out of CPU credits, they'll be pegged at 100% and stop responding correctly to TCP health checks.
- Autoscaling will see that it's passed the threshold for too many failures and will stop routing traffic there, and terminate the instance and add another
- Other instances fail at about the same time for the same reason (though hopefully this is staggered) and also get terminated and a new instance is created.
- each time an instance gets pegged, the elb traffic gets routed to one of the remaining ones, causing the remaining instances to consume their remaining cpu credits faster
Now, it's unlikely that the timings would work out so that all 4 instances are concurrently unavailable, and launching new instances is relatively quick. However it's still possible that each time an instance runs out of credits that it could be serving a request and thus send an error to the user. It also would create a lot of monitoring noise.
The other thing I could see being problematic is that depending on your ASG termination policy, when you scale back in, it's not guaranteed to terminate the instance with the fewest remaining cpu credis. So you may scale down with traffic at closer to 40% cpu, and run out of credits anyhow because the instance with the most credits was terminated on scale-in.
Edit: And I should add that the scale-in scenario can be especially problematic if a cooldown period is configured (default is 5 min). So if it just scaled back in to 2 instances, and those two were close to max and get pegged, it will not scale back up for 5 mintues. It will still attempt to replace them for health check failures, but once you get to a scenario where you only have 2 up, your likelihood of a full application outage goes up quite a bit (even if it's only for 2-3 minutes).
So I would say that even if using an ASG, special knowledge is required for effective use of T2 instances.
It really comes down to your use cases. If you have an infrequently used web app, or have a site that needs to run some small periodic tasks, then T2 instances are fantastic for this sort of thing, and they are super cheap too.
I normally run Sinatra/Padrino based web apps, and keep a close eye on my CPU credits via AWS's monitoring tools, and I have never gotten even close to using up the allocated credits.
I have nothing to do with any of these services
However, most of my projects are reliant on the entire AWS suite, and nearly all my projects use RDS for database storage and SES for handling emails etc., not to mention DynamoDB, S3, VPC, Elastic Beanstalk, CodeDeploy, CodePipeline etc. etc. which makes it easier to stay with the one provider and choose their cheapest instances to do the heavy lifting.
I just added a note about T2 Unlimited and will see about writing a follow up post explaining the new T2 Unlimited concepts.
EDIT: Added the note and gave a shout out, thanks again :).
Is it exempted?
From https://aws.amazon.com/lightsail/faq/
> Lightsail uses burstable performance instances that provide a baseline level of CPU performance with the additional ability to burst above the baseline.
I've added a note about T2 unlimited to my article, I will likely be writing up another article in the near future explaining T2 Unlimited concepts.
Thanks for the note!
[1] e.g., Linux on-demand for for t2.xlarge comes to $135.49/month, where as m5.xlarge is $140.16/month. I seem to remember last time I was doing reserved instance purchasing, the RI cost for m4 instances was actually less than equivalent t2 instances (I don't see it now, so maybe my memory is wrong).
Didn't get a chance to get to the bottom of this yet. Did anybody else notice something similar?
I do turn on t2 unlimited mode in case some horrible bug pegs the CPU.