For the love of god, stop using CPU limits on Kubernetes (updated) (opens in new tab)

(home.robusta.dev)

47 pointsciceryadam3y ago27 comments

27 comments

24 comments · 7 top-level

ekimekim3y ago· 4 in thread

The argument against this is consistency. Without a limit set, you are only guarenteed up to your request's worth of cpu, but you will often be allowed to have more. This can create a false sense of security, as your application is working fine (even though it occasionally exceeds its request). Until one day, when a neighbor happens to get thirsty, and your application suddenly breaks. Limits front-load the brokenness so that it happens immediately instead of randomly.

stingraycharles3y ago

It’s a trade-off that needs to be considered on a case-by-case basis. This is one of those cases where “one rule fits all” doesn’t work.

zufallsheld3y ago

> Until one day, when a neighbor happens to get thirsty, and your application suddenly breaks. Limits front-load the brokenness so that it happens immediately instead of randomly.

Can you expand on this? Why should the application break? It still has its guaranteed cpu requests. If it breaks with the defined requests, shouldn't it always break?

sascha_sl3y ago

I've had this happen in the real world.

In this case it was a GraphQL implementation that gradually grew in size, complexity and scope. The team maintaining it never adjusted the initial resource requirements within about 2 years because it has never been a problem - until a different (much larger by total CPU allocated) service started consuming all the bursting it could get and the GraphQL service just stopped processing requests. Before any autoscaling could kick in, it went into CrashLoopBackoff with failing liveness.

You can and should catch such cases early with monitoring, but our platform team was extremely tiny, especially for how many developers it served.

kobalsky3y ago

Not op, but it would break because your CPU request wasn't high enough from the start and the problem was hidden because you had CPU to spare in the node. Once you don't have CPU to spare, the app breaks.

EDIT: the vertical pod autoscaler helps with this as it will adjust requirements on pods to make sure they don't over/under allocate.

1 more reply

rahen3y ago· 3 in thread

We use CPU limits at work for the simple reason we can't autoscale deployments without having them set. An HPA will deploy a new pod each time the CPU limit has been reached for more than 30 seconds.

The whole point is to scale out, not up.

sgarland3y ago

CPU utilization isn't a great metric to scale on. If the app is running at 90% CPU and all requests are being processed well within your SLIs, it's fine.

KEDA [0] lets you scale on any number of things - basically if you scrape it with Prometheus, you can scale on it.

[0] https://keda.sh

dilyevsky3y ago

That is incorrect - hpa works off cpu request not limit (and usage).

rahen3y ago

When you do a "kubectl get hpa", the 0% and 100% rules are set according to the CPU request and limit. The current usage will be set to "<unknown>" if either is missing, and the deployment won't autoscale.

1 more reply

skyde3y ago· 3 in thread

I don’t agree on the recommendation for memory « Always set your memory requests equal to your limits »

you can layer high priority service and low priority service better if you use some buffer.

dilyevsky3y ago

The problem with that approach (overcommit) in k8s is when oom killer kicks in your containers will die immediately and thus you can easily corrupt some internal state if not careful. Also it’s easy to accidentally kill wrong process group bc oomkiller will calculate priority based on memory usage among other things so outsized processes can be killed before lower priority tiny processes

clhodapp3y ago

My current model for memory on k8s tends to agree with the article.

Would it be possible for you to explain "you can layer high priority service and low priority service better if you use some buffer" further?

skyde3y ago

I mean think of low priority service as services that are not latency sensitive (background jobs).

you want those low priory cgroup to use the extra memory if it’s available (not used by a high priority cgroup)

it a high priority cgroup need the memory it steal it from the low priority cgroup up to the minimum guaranteed to this low priority cgroup.

this low priority cgroup memory contention will turns into IO pressure from page faults. Then IO limit on the low priority cgroup will cap how much IO it can generate by throttling it.

Meanwhile the high priority cgroup use it’s guaranteed cpu, guaranteed IO and guaranteed memory with no hiccup.

to learn more check « fbtax2 memory controller configuration » at https://facebookmicrosites.github.io/cgroup2/docs/memory-con...

1 more reply

9887473y ago· 2 in thread

The reason to never use CPU limits is different than those stated in the article. In short: Linux kernel SUCKS. More specifically, the "Completely Fair Scheduler" (CFS) sucks at enforcing those limits. Setting any limit at all causes CFS to waste like half of CPU cycles on enforcing it, and only the other half is available for any useful work.

dilyevsky3y ago

The double accounting has been fixed a while ago. The real issue is poor defaults (100ms period is too long) and generally folks aren’t aware that limit actually sets cfs quota

oogali3y ago

Can’t you use a different scheduler on your nodes?

lazyant3y ago· 2 in thread

Looks to me the author hasn't run different workloads in different production clusters of any complexity. Advise is fine for a small predictable cluster but too simplistic for any real complex cluster.

sgarland3y ago

We run hundreds of nodes (autoscaling cluster) across multiple clusters, with wildly varying workloads, and almost none of the apps have CPU limits set. No issues.

dilyevsky3y ago

Same here. We have eliminated all limits entirely. Same scale (maybe slightly larger). Diverse workloads. In principle we can bring cfs back with like 1ms period but we’d probably look into cpuset pinning first

birdyrooster3y ago· 2 in thread

I don't think this is a remotely compelling argument to never use limits.

dhsysusbsjsi3y ago

The summary should be to use a request which satisfies the long term minimum cpu required at any given time. Then if you’re happy to allow it to burst then don’t set a limit. If you want consistency and constant runtime, set a limit. If you find that your container only works when it uses the “free” burst time, and breaks when it goes back to the request cpu, then your request is too low.

skyde3y ago

good summary!

iknownothow3y ago· 1 in thread

This advice comes from tunnel vision and makes perfect sense if you know that you have exactly two pods running at any given time. But if you have exactly two pods, then why bother use k8s? IIRC one of the major selling points of K8s was on-demand scaling or auto scaling horizontally. Which means the number of pods you have in the cluster is dynamic.

In the context of pods dynamically spinning up and spinning down, it's bad when a pod replica can't be allocated in the cluster "predictably" but there is nothing worse than when a new pod (new deployment) fails because "Marcus the pod" drank all the water and now I have to call DevOps and wait god knows how long before they spin up a new node to guarantee a spot for the new pod.

Bin-packing is a already an np-hard problem. If you remove limits from CPU then you're adding probabilities into the mix. So, for the love of god, always use limits unless you have a very specific use case.

sgarland3y ago

> I have to call DevOps and wait god knows how long before they spin up a new node to guarantee a spot for the new pod.

Cluster autoscaler. No need to call anyone.

j / k navigate · click thread line to collapse

27 comments

24 comments · 7 top-level

ekimekim3y ago· 4 in thread

stingraycharles3y ago

It’s a trade-off that needs to be considered on a case-by-case basis. This is one of those cases where “one rule fits all” doesn’t work.

zufallsheld3y ago

> Until one day, when a neighbor happens to get thirsty, and your application suddenly breaks. Limits front-load the brokenness so that it happens immediately instead of randomly.

Can you expand on this? Why should the application break? It still has its guaranteed cpu requests. If it breaks with the defined requests, shouldn't it always break?

sascha_sl3y ago

I've had this happen in the real world.

You can and should catch such cases early with monitoring, but our platform team was extremely tiny, especially for how many developers it served.

kobalsky3y ago

EDIT: the vertical pod autoscaler helps with this as it will adjust requirements on pods to make sure they don't over/under allocate.

1 more reply

rahen3y ago· 3 in thread

We use CPU limits at work for the simple reason we can't autoscale deployments without having them set. An HPA will deploy a new pod each time the CPU limit has been reached for more than 30 seconds.

The whole point is to scale out, not up.

sgarland3y ago

CPU utilization isn't a great metric to scale on. If the app is running at 90% CPU and all requests are being processed well within your SLIs, it's fine.

KEDA [0] lets you scale on any number of things - basically if you scrape it with Prometheus, you can scale on it.

[0] https://keda.sh

dilyevsky3y ago

That is incorrect - hpa works off cpu request not limit (and usage).

rahen3y ago

1 more reply

skyde3y ago· 3 in thread

I don’t agree on the recommendation for memory « Always set your memory requests equal to your limits »

you can layer high priority service and low priority service better if you use some buffer.

dilyevsky3y ago

clhodapp3y ago

My current model for memory on k8s tends to agree with the article.

Would it be possible for you to explain "you can layer high priority service and low priority service better if you use some buffer" further?

skyde3y ago

I mean think of low priority service as services that are not latency sensitive (background jobs).

you want those low priory cgroup to use the extra memory if it’s available (not used by a high priority cgroup)

it a high priority cgroup need the memory it steal it from the low priority cgroup up to the minimum guaranteed to this low priority cgroup.

this low priority cgroup memory contention will turns into IO pressure from page faults. Then IO limit on the low priority cgroup will cap how much IO it can generate by throttling it.

Meanwhile the high priority cgroup use it’s guaranteed cpu, guaranteed IO and guaranteed memory with no hiccup.

to learn more check « fbtax2 memory controller configuration » at https://facebookmicrosites.github.io/cgroup2/docs/memory-con...

1 more reply

9887473y ago· 2 in thread

dilyevsky3y ago

The double accounting has been fixed a while ago. The real issue is poor defaults (100ms period is too long) and generally folks aren’t aware that limit actually sets cfs quota

oogali3y ago

Can’t you use a different scheduler on your nodes?

lazyant3y ago· 2 in thread

sgarland3y ago

We run hundreds of nodes (autoscaling cluster) across multiple clusters, with wildly varying workloads, and almost none of the apps have CPU limits set. No issues.

dilyevsky3y ago

birdyrooster3y ago· 2 in thread

I don't think this is a remotely compelling argument to never use limits.

dhsysusbsjsi3y ago

skyde3y ago

good summary!

iknownothow3y ago· 1 in thread

sgarland3y ago

> I have to call DevOps and wait god knows how long before they spin up a new node to guarantee a spot for the new pod.

Cluster autoscaler. No need to call anyone.

j / k navigate · click thread line to collapse