How We Found 7 TiB of Memory Just Sitting Around (opens in new tab)

(render.com)

207 pointsanurag7mo ago74 comments

74 comments

27 comments · 6 top-level

Aeolun7mo ago· 10 in thread

I read this and I have to wonder, did anyone ever think it was reasonable that a cluster that apparently needed only 120gb of memory was consuming 1.2TB just for logging (or whatever vector does)

devjab7mo ago

We're a much smaller scale company and the cost we lose on these things is insignificant compared to what's in this story. Yesterday I was improving the process for creating databases in our azure and I stumbled upon a subscription which was running 7 mssql servers for 12 databases. These weren't elastic and they were each paying a license that we don't have to pay because we qualify for the base cost through our contract with our microsoft partner. This company has some of the thightest control over their cloud infrastructure out of any organisation I've worked with.

This is anecdotal, but if my experiences aren't unique then there is a lot of lack of reasonable in DevOps.

ffsm87mo ago

Isn't that mostly down to the fact the vast majority of devs explicitly don't want to do anything wrt Ops?

DevOps has - ever since it's originally well meaning inception (by Netflix iirc?) - been implemented across our industry as an effective cost cutting measure, forcing devs that didn't see it as their job to also handle it.

Which consequently means they're not interfacing with it whatsoever. They do as little as they can get away with, which inevitably means things are being done with borderline malicious compliance... Or just complete incompetence.

I'm not even sure I'd blame these devs in particular. The devs just saw it as a quick bonus generator for the MBA in charge of this rebranding while offloading more responsibilities in their shoulders.

DevOps made total sense in the work culture where this concept was conceived - Netflix was well known at that point to only ever employ senior Devs. However, in the context of the average 9-5 dev, which often knows a lot less then even some enthusiastic Jrs... Let's just say that it's incredibly dicey wherever it's successful in practice.

mustyoshi7mo ago

I politely disagree. I spent maybe 8 hours over a week rightsizing a handful of heavy deployments from a previous team and reduced their peak resource usage by implementing better scaling policies. Before the new scaling policy the service would scale out and new pods would remain idle and ultimately get terminated without ever responding to a request quite frequently.

The service dashboards already existed, all I had to do was a bit of load testing and read the graphs.

It's not too much extra work to make sure you're scaling efficiently.

1 more reply

FroshKiller7mo ago

The first time my director asked me if I'd ever heard of DevOps, I said, "Sure, doing two jobs for one paycheck." I'm a software developer, buddy. I write the programs. Leave me out of running them.

1 more reply

bstack7mo ago

Author here: You’d be surprised what you don’t notice given enough nodes and slow enough resource growth over time! Out of the total resource usage in these clusters even at the high water mark for this daemonset it was still a small overall portion of the total.

Aeolun7mo ago

I’m not sure if that makes it better or worse.

2 more replies

fock7mo ago

how large are the clusters then?

formerly_proven7mo ago

It probably doesn't help that the first line of treatment for any error is to blindly increase memory request/limit and claim it's fixed (preferably without looking at the logs once).

fock7mo ago

we have on-prem with heavy spikes (our batch workload can utilize the 20TB of memory in the cluster easily) and we just don't care much and add 10% every year to the hardware requested. Compared to employing people or paying other vendors (relational databases with many TB-sized tables...) this is just irrelevant.

Sadly devs are incentivized by that and going towards the cloud might be a fun story. Given the environment I hope they scrap the effort sooner rather than later, buy some Oxide systems for the people who need to iterate faster than the usual process of getting a VM and replace/reuse the 10% of the company occupied with the cloud (mind you: no real workload runs there yet...) to actually improve local processes...

g-mork7mo ago

Somewhat unrelated, but you just tied wasteful software design to high it salaries, and also suggest a reason why Russian programmers might also seem to on the whole be far more effective than we are

I wonder if msft simply cut dev salaries by 50% in the 90s, would it have had any measurable effect on windows quality by today

shanemhansen7mo ago· 5 in thread

The unreasonable effectiveness of profiling and digging deep strikes again.

hinkley7mo ago

The biggest tool in the performance toolbox is stubbornness. Without it all the mechanical sympathy in the world will go unexploited.

There’s about a factor of 3 improvement that can be made to most code after the profiler has given up. That probably means there are better profilers than could be written, but in 20 years of having them I’ve only seen 2 that tried. Sadly I think flame graphs made profiling more accessible to the unmotivated but didn’t actually improve overall results.

Negitivefrags7mo ago

I think the biggest tool is higher expectations. Most programmers really haven't come to grips with the idea that computers are fast.

If you see a database query that takes 1 hour to run, and only touches a few gb of data, you should be thinking "Well nvme bandwidth is multiple gigabytes per second, why can't it run in 1 second or less?"

The idea that anyone would accept a request to a website taking longer than 30ms, (the time it takes for a game to render it's entire world including both the CPU and GPU parts at 60fps) is insane, and nobody should really accept it, but we commonly do.

4 more replies

zahlman7mo ago

> The biggest tool in the performance toolbox is stubbornness. Without it all the mechanical sympathy in the world will go unexploited.

The sympathy is also needed. Problems aren't found when people don't care, or consider the current performance acceptable.

> There’s about a factor of 3 improvement that can be made to most code after the profiler has given up. That probably means there are better profilers than could be written, but in 20 years of having them I’ve only seen 2 that tried.

It's hard for profilers to identify slowdowns that are due to the architecture. Making the function do less work to get its result feels different from determining that the function's result is unnecessary.

1 more reply

jesse__7mo ago

Broadly agree.

I'm curious, what're the profilers you know of that tried to be better? I have a little homebrew game engine with an integrated profiler that I'm always looking for ideas to make more effective.

1 more reply

seg_lol7mo ago

Unreasonable effectiveness of looking.

nitinreddy887mo ago· 3 in thread

The other way to look is why adding NS label is causing so much memory footprint in Kubernetes. Shouldn't be fixing that (could be much bigger design change), will benefit whole Kube community?

bstack7mo ago

Author here: yeah that's a good point. tbh I was mostly unfamiliar with Vector so I took the shortest path to the goal but that could be interesting followup. It does seem like there's a lot of bytes per namespace!

stackskipton7mo ago

You mentioned in the blog article that it's doing listwatch. List Watch registers with Kubernetes API that get a list of all objects AND get a notification when anything in object you have registered with changes. A bunch of Vector Pods saying "Hey, send me a notification when anything with namespaces changes" and poof goes your Memory keeping track of who needs to know what.

At this point, I wonder if instead of relying on daemonsets, you just gave every namespace a vector instance that was responsible for that namespace and pods within. ElasticSearch or whatever you pipe logging data to might not be happy with all those TCP connections.

Just my SRE brain thoughts.

fells7mo ago

>you just gave every namespace a vector instance that was responsible for that namespace and pods within.

Vector is a daemonset, because it needs to tail the log files on each node. A single vector per namespace might not reside on the nodes that each pod is on.

2 more replies

hinkley7mo ago· 2 in thread

Keys require O(logn) space per key or nlogn for the entire data set, simply to avoid key collisions. But human friendly key spaces grow much, much faster and I don’t think many people have looked too hard at that.

There were recent changes to the NodeJS Prometheus client that eliminates tag names from the keys used for storing the tag cardinality for metrics. The memory savings wasn’t reported but the cpu savings for recording data points was over 1/3. And about twice that when applied to the aggregation logic.

Lookups are rarely O(1), even in hash tables.

I wonder if there’s a general solution for keeping names concise without triggering transposition or reading comprehension errors. And what the space complexity is of such an algorithm.

vlovich1237mo ago

Why aren’t let’s just 128bit UUIDs? Those are guaranteed to be globally unique and don’t require so much spacex

hinkley7mo ago

Why aren’t what 128bit UUIDs?

> keeping names concise without triggering transposition or reading comprehension errors.

Code that doesn’t work for developers first will soon cease to work for anyone. Plus how do you look up a uuid for a set of tags? What’s your perfect hash plan to make sure you don’t misattribute stats to the wrong place?

UUIDs are entirely opaque and difficult to tell apart consistently.

liampulles7mo ago· 1 in thread

I'm a little surprised that it got to the point where pods which should consume a couple MB of RAM were consuming 4GB before action was taken. But I can also kind of understand it, because the way k8s operators (apps running in k8s that manipulate k8s resource) are meant to run is essentially a loop of listing resources, comparing to spec, and making moves to try and bring the state of the cluster closer to spec. This reconciliation loop is simple to understand (and I think this benefit has led to the creation of a wide array of excellent open source and proprietary operators that can be added to clusters). But its also a recipe for cascading explosions in resource usage.

These kind of resource explosions are something I see all the time in k8s clusters. The general advice is to always try and keep pressure off the k8s API, and the consequence is that one must be very minimal and tactical with the operators one installs, and then engage in many hours of work trying to fine tune each operator to run efficiently (e.g. Grafana, whose default helm settings do not use the recommended log indexing algorithm, and which needs to be tweaked to get an appropriate set of read vs. write pods for your situation).

Again, I recognize there is a tradeoff here - the simplicity and openness of the k8s API is what has led to a flourish of new operators, which really has allowed one to run "their own cloud". But there is definitely a cost. I don't know what the solution is, and I'm curious to hear from people who have other views of it, or use other solutions to k8s which offer a different set of tradeoffs.

never_inline7mo ago

> are meant to run is essentially a loop of listing resources, comparing to spec, and making moves to try and bring the state of the cluster closer to spec.

Aren't they supposed to use watch/long polling?

timzaman7mo ago

7tib.. that's like 3 servers..

j / k navigate · click thread line to collapse

74 comments

27 comments · 6 top-level

Aeolun7mo ago· 10 in thread

I read this and I have to wonder, did anyone ever think it was reasonable that a cluster that apparently needed only 120gb of memory was consuming 1.2TB just for logging (or whatever vector does)

devjab7mo ago

This is anecdotal, but if my experiences aren't unique then there is a lot of lack of reasonable in DevOps.

ffsm87mo ago

Isn't that mostly down to the fact the vast majority of devs explicitly don't want to do anything wrt Ops?

mustyoshi7mo ago

The service dashboards already existed, all I had to do was a bit of load testing and read the graphs.

It's not too much extra work to make sure you're scaling efficiently.

1 more reply

FroshKiller7mo ago

The first time my director asked me if I'd ever heard of DevOps, I said, "Sure, doing two jobs for one paycheck." I'm a software developer, buddy. I write the programs. Leave me out of running them.

1 more reply

bstack7mo ago

Aeolun7mo ago

I’m not sure if that makes it better or worse.

2 more replies

fock7mo ago

how large are the clusters then?

formerly_proven7mo ago

It probably doesn't help that the first line of treatment for any error is to blindly increase memory request/limit and claim it's fixed (preferably without looking at the logs once).

fock7mo ago

g-mork7mo ago

Somewhat unrelated, but you just tied wasteful software design to high it salaries, and also suggest a reason why Russian programmers might also seem to on the whole be far more effective than we are

I wonder if msft simply cut dev salaries by 50% in the 90s, would it have had any measurable effect on windows quality by today

shanemhansen7mo ago· 5 in thread

The unreasonable effectiveness of profiling and digging deep strikes again.

hinkley7mo ago

The biggest tool in the performance toolbox is stubbornness. Without it all the mechanical sympathy in the world will go unexploited.

Negitivefrags7mo ago

I think the biggest tool is higher expectations. Most programmers really haven't come to grips with the idea that computers are fast.

4 more replies

zahlman7mo ago

> The biggest tool in the performance toolbox is stubbornness. Without it all the mechanical sympathy in the world will go unexploited.

The sympathy is also needed. Problems aren't found when people don't care, or consider the current performance acceptable.

1 more reply

jesse__7mo ago

Broadly agree.

I'm curious, what're the profilers you know of that tried to be better? I have a little homebrew game engine with an integrated profiler that I'm always looking for ideas to make more effective.

1 more reply

seg_lol7mo ago

Unreasonable effectiveness of looking.

nitinreddy887mo ago· 3 in thread

The other way to look is why adding NS label is causing so much memory footprint in Kubernetes. Shouldn't be fixing that (could be much bigger design change), will benefit whole Kube community?

bstack7mo ago

stackskipton7mo ago

Just my SRE brain thoughts.

fells7mo ago

>you just gave every namespace a vector instance that was responsible for that namespace and pods within.

Vector is a daemonset, because it needs to tail the log files on each node. A single vector per namespace might not reside on the nodes that each pod is on.

2 more replies

hinkley7mo ago· 2 in thread

Lookups are rarely O(1), even in hash tables.

I wonder if there’s a general solution for keeping names concise without triggering transposition or reading comprehension errors. And what the space complexity is of such an algorithm.

vlovich1237mo ago

Why aren’t let’s just 128bit UUIDs? Those are guaranteed to be globally unique and don’t require so much spacex

hinkley7mo ago

Why aren’t what 128bit UUIDs?

> keeping names concise without triggering transposition or reading comprehension errors.

UUIDs are entirely opaque and difficult to tell apart consistently.

liampulles7mo ago· 1 in thread

never_inline7mo ago

> are meant to run is essentially a loop of listing resources, comparing to spec, and making moves to try and bring the state of the cluster closer to spec.

Aren't they supposed to use watch/long polling?

timzaman7mo ago

7tib.. that's like 3 servers..

j / k navigate · click thread line to collapse