AWS EC2 T2 Instances Demystified: Don’t Learn the Hard Way (opens in new tab)

(roberttisdale.com)

147 pointsrtisdale8y ago38 comments

38 comments

30 comments · 12 top-level

timjulien8y ago· 7 in thread

As I myself learned the hard way, T2 are awful for production websites with sustained high traffic. The author points out the biggest issue:

“About 40 minutes into Hour 6 your instance no longer has any credits remaining. At this point, you’re now limited to the baseline performance of the instance, in this case, 10% of the vCPU.

At this point your application grinds to a halt.”

Agreed. As terrible as that is, it’s actually worse than that in my experience, bc even while you have positive CPU credits, you can’t spend them down smoothly to achieve high sustained CPU - the scheduler gives you something more like a sine wave of performance. Which is mystifying to say the least and leads me to my final point about why T2 is so awful: it violates the principle of least astonishment. Why would anyone expect that a CPU is limited to a small fraction of its output? T2 should not be categorized in the “general computing” aws instance family. It should be in a family all its own called “burstable” that explains the major difference between it and the general computing family.

not_kurt_godel8y ago

> T2 are awful for production websites with sustained high traffic

I mean, their entire purpose and value proposition is explicitly based on not needing to support sustained high traffic. This is not something that needs to be learned the hard way. You don't even need to RTFM, you just need to read the first sentence of the first Google result snippet:

> T2 instances are low-cost, General Purpose instance type that provide a baseline level of CPU performance with the ability to burst above the baseline.

In fact, they are even explicitly labeled as "burstable performance instances" as you suggest in the first five words of the first paragraph of the first Google result for "aws t2": https://aws.amazon.com/ec2/instance-types/#burst

sudosteph8y ago

Back when I worked at AWS and t1s were still the big thing, we had a policy of telling customers to never run a production site on a t1.micro . I would say the same about t2s for the most part. AWS needs to do a better job of putting big red warnings all over the docs or console or something. Even though the details really are all there in the docs, it's very easy for people to get lulled into a sense of false security if they're following along with a blog post or just not going over the docs in depth. But really, reading the service docs in depth, does pay off.

Totally agree that it should not be "general computing" categorized though. That category in general is too vague to be useful really. They should clearly demarcate between production service hosts and development-oriented instance types.

jedberg8y ago

> we had a policy of telling customers to never run a production site on a t1.micro

It's perfectly safe to run a production site on a t1.micro, as long as you follow good principles regarding monitoring and autoscaling.

You'd be a fool to run a production website on any instance type without those failsafes, but with them in place, a t1/t2.micro is no worse than any other. And in fact, since they boot so quickly, they are really nice with auto-scaling.

1 more reply

rtisdaleOP8y ago

Howdy, I'm the author :)

I definitely agree, for high traffic applications a compute or memory optimized instance are more optimal choices due to the consistent performance, especially when price isn't as much of a concern.

In situations where cost is more of a concern and the application has "bursty" requirements for performance, I still believe the T2 instance has it's place.

The guys over at Cloudonaut have an excellent article on the importance of T2 burst credits to maintain application performance.

https://cloudonaut.io/burst-credits-of-t2-ec2-instances-need...

EDIT: The new T2 Unlimited option mentioned at the above article is also worth considering.

_pmf_8y ago

What would be the purpose (for the layman)? Handling spikey event ingress?

sudosteph8y ago

They're still good for: - non production dev servers

- executing recurring scheduled events with known estimate for cpu usage (cron jobs)

- potentially for non-overly busy build servers

I wouldn't use them for anything that needs production reliability though.

Fomite8y ago

I have a very simple web app that is, when being used (rarely) is fairly computationally demanding (in a relative sense). That kind of load suits the T2 instance really well in my experience.

sudosteph8y ago· 3 in thread

> You will likely find it is difficult or impossible to access the server to take any measures to solve the issue until the CPU credits have accrued.

Only if you don't want to enable t2 unlimited -> https://docs.aws.amazon.com/en_us/cli/latest/reference/ec2/m...

But really seems like it would be simple enough to rig something up that would:

1. Trigger CloudWatch Alarm on Insufficient T2 Credits

2. Send SNS notification to SNS topic when triggered

3. have lambda function subscribed to enable t2 unlimited

4. have subscribed email or slack alert to be like "hey something's up, we enabled t2 unlimited for now"

Though CW alarms for this are limited to running every 5 min, so could take a bit to notify, or you might need to set your threshold a bit high at first. I usually prefer to configure a CW event rule for this kind of stuff, but I don't see any support for that out of the box.

At least they explain the math pretty decently for predicting the usage: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/t2-insta...

jweir8y ago

Terraform just released support for the unlimited flag via the 'credit_specification' field

https://github.com/terraform-providers/terraform-provider-aw...

sudosteph8y ago

And CloudFormation users can specify it here: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGui...

Good to see it's becoming more common knowledge though.

rtisdaleOP8y ago

Great points on the T2 Unlimited and an interesting solution for enabling T2 Unlimited and alerting.

I was sure to add a note to the article after a few people here mentioned it.

Thanks for sharing your thoughts :).

sudosteph8y ago· 2 in thread

Sorry, I already commented but did want to recognize an important point OP made.

> The examples above are contrived but it does illustrate the danger of not properly architecting your AWS system and monitoring it once it’s in place (especially if you’re using T2 instances).

> I often see this issue with a website that receives increases in the baseline traffic over long periods of time with occasional spikes. If the instance type is not changed and sized appropriately, this eventually leads to all the accrued CPU credits being used paired with downtime and a significant amount of head scratching.

This is so true. Proper architecture and planning up front will save all kinds of headaches. Anyone serious about using EC2 needs to understand and use autoscaling groups with ELBs to manage scaling. If production availability is important, and you don't want to be bugged to fix something at 3AM, ASGs are a nice way to mitigate for potential AZ issues (you can do that with an ASG with max size of 1 even!). The biggest problem I always saw was that people would not take the time to test what happens when an ASG instance is terminated. Also You can't be ssh'ing in to manually change things and expect those changes to stick around.

Proper monitoring and alerts if of course the other side to this.

sudhirj8y ago

If I remember correctly, just setting autoscaling based on CPU usages already works fine and does the right thing with T2 instances as well - if you don't have enough credits to drive your load your CPU is going to be stuck at 100%, irrespective of credit balance or instance size. So the common sense method of scaling up at 70% / 80% CPU capacity should already work fine for any instance, no special knowledge required.

sudosteph8y ago

If your load is still consistently high enough, it is possible to be caught in an ASG cycling situation with t2s that can affect your application.

Ex: - Say your ASG is set to 4 max instances, min 2

- You get a sustained increase in traffic and scale up to 4 instances

- Sustained 70% across 4 instances consumes more credits than are earned

- As instances run out of CPU credits, they'll be pegged at 100% and stop responding correctly to TCP health checks.

- Autoscaling will see that it's passed the threshold for too many failures and will stop routing traffic there, and terminate the instance and add another

- Other instances fail at about the same time for the same reason (though hopefully this is staggered) and also get terminated and a new instance is created.

- each time an instance gets pegged, the elb traffic gets routed to one of the remaining ones, causing the remaining instances to consume their remaining cpu credits faster

Now, it's unlikely that the timings would work out so that all 4 instances are concurrently unavailable, and launching new instances is relatively quick. However it's still possible that each time an instance runs out of credits that it could be serving a request and thus send an error to the user. It also would create a lot of monitoring noise.

The other thing I could see being problematic is that depending on your ASG termination policy, when you scale back in, it's not guaranteed to terminate the instance with the fewest remaining cpu credis. So you may scale down with traffic at closer to 40% cpu, and run out of credits anyhow because the instance with the most credits was terminated on scale-in.

Edit: And I should add that the scale-in scenario can be especially problematic if a cooldown period is configured (default is 5 min). So if it just scaled back in to 2 instances, and those two were close to max and get pegged, it will not scale back up for 5 mintues. It will still attempt to replace them for health check failures, but once you get to a scenario where you only have 2 up, your likelihood of a full application outage goes up quite a bit (even if it's only for 2-3 minutes).

So I would say that even if using an ASG, special knowledge is required for effective use of T2 instances.

cyberferret8y ago· 2 in thread

The post raises some good points, but in reality, I've been using T2 instances on my projects for a long time with good results.

It really comes down to your use cases. If you have an infrequently used web app, or have a site that needs to run some small periodic tasks, then T2 instances are fantastic for this sort of thing, and they are super cheap too.

I normally run Sinatra/Padrino based web apps, and keep a close eye on my CPU credits via AWS's monitoring tools, and I have never gotten even close to using up the allocated credits.

2RTZZSro8y ago

If you are not looking for an exotic instance type or instance type flexibility Vultr provides superior servers at every comparable EC2 price point. Additionally Vultr provides an extensible external hard drive so-called "block storage" for a very low rate. Digital Ocean is great as well. Keep your eyes peeled for deals on https://lowendbox.com/

I have nothing to do with any of these services

cyberferret8y ago

I do use Digital Ocean for some projects, and have been looking at Vultr recently as well.

However, most of my projects are reliant on the entire AWS suite, and nearly all my projects use RDS for database storage and SES for handling emails etc., not to mention DynamoDB, S3, VPC, Elastic Beanstalk, CodeDeploy, CodePipeline etc. etc. which makes it easier to stay with the one provider and choose their cheapest instances to do the heavy lifting.

2 more replies

rosege8y ago· 2 in thread

No mention of the newish T2 unlimited setting either

rtisdaleOP8y ago

Author here, thanks for the suggestion!

I just added a note about T2 Unlimited and will see about writing a follow up post explaining the new T2 Unlimited concepts.

EDIT: Added the note and gave a shout out, thanks again :).

rosege8y ago

Great thanks - sent your article to some colleagues who need to know this stuff :-)

01walid8y ago· 1 in thread

I'm wondering what's the position of Amazon Lightsail in all of this ?

Is it exempted?

laCour8y ago

Lightsail instances are just T2 instances, and they are not exempt from CPU credits.

From https://aws.amazon.com/lightsail/faq/

> Lightsail uses burstable performance instances that provide a baseline level of CPU performance with the additional ability to burst above the baseline.

appdrag8y ago· 1 in thread

I'm really happy with T2 instances, and T2 Unlimited have solved the issue described in this article!

rtisdaleOP8y ago

Howdy appdrag.

I've added a note about T2 unlimited to my article, I will likely be writing up another article in the near future explaining T2 Unlimited concepts.

Thanks for the note!

antoncohen8y ago

Why use t2 instances of t2.large or larger? The price difference to go to m5 instances is negligible[1], and with m5 you don't have to worry about CPU credits and you get better networking.

[1] e.g., Linux on-demand for for t2.xlarge comes to $135.49/month, where as m5.xlarge is $140.16/month. I seem to remember last time I was doing reserved instance purchasing, the RI cost for m4 instances was actually less than equivalent t2 instances (I don't see it now, so maybe my memory is wrong).

benmmurphy8y ago

you get the same problem with gp2 EBS storage when you have less than the maximum baseline performance. if you have sustained peak IO then the disk will eventually have a big performance drop. i guess people are less likely to hit this wall than the T2 CPU wall. it is worth checking how long you can run at peak load for your application before you get throttled and how your application behaves when it gets throttled.

pritambarhate8y ago

Has anyone noticed that even after enabling T2 Unlimited it takes a few minutes for the application to become responsive again? I have an application with a Spiky load. I have turned on T2 unlimited for this server. But still, I noticed that the application was becoming unresponsive for a few minutes when there was a continuous load.

Didn't get a chance to get to the bottom of this yet. Did anybody else notice something similar?

inopinatus8y ago

I’ve found them really great for a memory-bound Rails app that needs very little CPU. As ever choosing the right instance type is a matter of mechanical sympathy.

I do turn on t2 unlimited mode in case some horrible bug pegs the CPU.

budhajeewa8y ago

Any idea whether is this the same behavior in comparable computing instances in GCP as well?

j / k navigate · click thread line to collapse

38 comments

30 comments · 12 top-level

timjulien8y ago· 7 in thread

As I myself learned the hard way, T2 are awful for production websites with sustained high traffic. The author points out the biggest issue:

“About 40 minutes into Hour 6 your instance no longer has any credits remaining. At this point, you’re now limited to the baseline performance of the instance, in this case, 10% of the vCPU.

At this point your application grinds to a halt.”

not_kurt_godel8y ago

> T2 are awful for production websites with sustained high traffic

> T2 instances are low-cost, General Purpose instance type that provide a baseline level of CPU performance with the ability to burst above the baseline.

sudosteph8y ago

jedberg8y ago

> we had a policy of telling customers to never run a production site on a t1.micro

It's perfectly safe to run a production site on a t1.micro, as long as you follow good principles regarding monitoring and autoscaling.

1 more reply

rtisdaleOP8y ago

Howdy, I'm the author :)

I definitely agree, for high traffic applications a compute or memory optimized instance are more optimal choices due to the consistent performance, especially when price isn't as much of a concern.

In situations where cost is more of a concern and the application has "bursty" requirements for performance, I still believe the T2 instance has it's place.

The guys over at Cloudonaut have an excellent article on the importance of T2 burst credits to maintain application performance.

https://cloudonaut.io/burst-credits-of-t2-ec2-instances-need...

EDIT: The new T2 Unlimited option mentioned at the above article is also worth considering.

_pmf_8y ago

What would be the purpose (for the layman)? Handling spikey event ingress?

sudosteph8y ago

They're still good for: - non production dev servers

- executing recurring scheduled events with known estimate for cpu usage (cron jobs)

- potentially for non-overly busy build servers

I wouldn't use them for anything that needs production reliability though.

Fomite8y ago

I have a very simple web app that is, when being used (rarely) is fairly computationally demanding (in a relative sense). That kind of load suits the T2 instance really well in my experience.

sudosteph8y ago· 3 in thread

> You will likely find it is difficult or impossible to access the server to take any measures to solve the issue until the CPU credits have accrued.

Only if you don't want to enable t2 unlimited -> https://docs.aws.amazon.com/en_us/cli/latest/reference/ec2/m...

But really seems like it would be simple enough to rig something up that would:

1. Trigger CloudWatch Alarm on Insufficient T2 Credits

2. Send SNS notification to SNS topic when triggered

3. have lambda function subscribed to enable t2 unlimited

4. have subscribed email or slack alert to be like "hey something's up, we enabled t2 unlimited for now"

At least they explain the math pretty decently for predicting the usage: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/t2-insta...

jweir8y ago

Terraform just released support for the unlimited flag via the 'credit_specification' field

https://github.com/terraform-providers/terraform-provider-aw...

sudosteph8y ago

And CloudFormation users can specify it here: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGui...

Good to see it's becoming more common knowledge though.

rtisdaleOP8y ago

Great points on the T2 Unlimited and an interesting solution for enabling T2 Unlimited and alerting.

I was sure to add a note to the article after a few people here mentioned it.

Thanks for sharing your thoughts :).

sudosteph8y ago· 2 in thread

Sorry, I already commented but did want to recognize an important point OP made.

> The examples above are contrived but it does illustrate the danger of not properly architecting your AWS system and monitoring it once it’s in place (especially if you’re using T2 instances).

Proper monitoring and alerts if of course the other side to this.

sudhirj8y ago

sudosteph8y ago

If your load is still consistently high enough, it is possible to be caught in an ASG cycling situation with t2s that can affect your application.

Ex: - Say your ASG is set to 4 max instances, min 2

- You get a sustained increase in traffic and scale up to 4 instances

- Sustained 70% across 4 instances consumes more credits than are earned

- As instances run out of CPU credits, they'll be pegged at 100% and stop responding correctly to TCP health checks.

- Autoscaling will see that it's passed the threshold for too many failures and will stop routing traffic there, and terminate the instance and add another

- Other instances fail at about the same time for the same reason (though hopefully this is staggered) and also get terminated and a new instance is created.

- each time an instance gets pegged, the elb traffic gets routed to one of the remaining ones, causing the remaining instances to consume their remaining cpu credits faster

So I would say that even if using an ASG, special knowledge is required for effective use of T2 instances.

cyberferret8y ago· 2 in thread

The post raises some good points, but in reality, I've been using T2 instances on my projects for a long time with good results.

I normally run Sinatra/Padrino based web apps, and keep a close eye on my CPU credits via AWS's monitoring tools, and I have never gotten even close to using up the allocated credits.

2RTZZSro8y ago

I have nothing to do with any of these services

cyberferret8y ago

I do use Digital Ocean for some projects, and have been looking at Vultr recently as well.

2 more replies

rosege8y ago· 2 in thread

No mention of the newish T2 unlimited setting either

rtisdaleOP8y ago

Author here, thanks for the suggestion!

I just added a note about T2 Unlimited and will see about writing a follow up post explaining the new T2 Unlimited concepts.

EDIT: Added the note and gave a shout out, thanks again :).

rosege8y ago

Great thanks - sent your article to some colleagues who need to know this stuff :-)

01walid8y ago· 1 in thread

I'm wondering what's the position of Amazon Lightsail in all of this ?

Is it exempted?

laCour8y ago

Lightsail instances are just T2 instances, and they are not exempt from CPU credits.

From https://aws.amazon.com/lightsail/faq/

> Lightsail uses burstable performance instances that provide a baseline level of CPU performance with the additional ability to burst above the baseline.

appdrag8y ago· 1 in thread

I'm really happy with T2 instances, and T2 Unlimited have solved the issue described in this article!

rtisdaleOP8y ago

Howdy appdrag.

I've added a note about T2 unlimited to my article, I will likely be writing up another article in the near future explaining T2 Unlimited concepts.

Thanks for the note!

antoncohen8y ago

Why use t2 instances of t2.large or larger? The price difference to go to m5 instances is negligible[1], and with m5 you don't have to worry about CPU credits and you get better networking.

benmmurphy8y ago

pritambarhate8y ago

Didn't get a chance to get to the bottom of this yet. Did anybody else notice something similar?

inopinatus8y ago

I’ve found them really great for a memory-bound Rails app that needs very little CPU. As ever choosing the right instance type is a matter of mechanical sympathy.

I do turn on t2 unlimited mode in case some horrible bug pegs the CPU.

budhajeewa8y ago

Any idea whether is this the same behavior in comparable computing instances in GCP as well?

j / k navigate · click thread line to collapse