This is anecdotal, but if my experiences aren't unique then there is a lot of lack of reasonable in DevOps.
DevOps has - ever since it's originally well meaning inception (by Netflix iirc?) - been implemented across our industry as an effective cost cutting measure, forcing devs that didn't see it as their job to also handle it.
Which consequently means they're not interfacing with it whatsoever. They do as little as they can get away with, which inevitably means things are being done with borderline malicious compliance... Or just complete incompetence.
I'm not even sure I'd blame these devs in particular. The devs just saw it as a quick bonus generator for the MBA in charge of this rebranding while offloading more responsibilities in their shoulders.
DevOps made total sense in the work culture where this concept was conceived - Netflix was well known at that point to only ever employ senior Devs. However, in the context of the average 9-5 dev, which often knows a lot less then even some enthusiastic Jrs... Let's just say that it's incredibly dicey wherever it's successful in practice.
The service dashboards already existed, all I had to do was a bit of load testing and read the graphs.
It's not too much extra work to make sure you're scaling efficiently.
And they're a "Cloud Application Platform" meaning they manage deploys and infrastructure for other people. Their website says "Click, click, done." which is cool and quick and all, but to me it's kind of crazy an organization that should be really engineering focused and mature, doesn't immediately notice 1.2TB being used and tries to figure out why, when 120GB ended up being sufficient.
It gives much more of a "We're a startup, we're learning as we're running" vibe which again, cool and all, but hardly what people should use for hosting their own stuff on.
Sadly devs are incentivized by that and going towards the cloud might be a fun story. Given the environment I hope they scrap the effort sooner rather than later, buy some Oxide systems for the people who need to iterate faster than the usual process of getting a VM and replace/reuse the 10% of the company occupied with the cloud (mind you: no real workload runs there yet...) to actually improve local processes...
I wonder if msft simply cut dev salaries by 50% in the 90s, would it have had any measurable effect on windows quality by today
At this point, I wonder if instead of relying on daemonsets, you just gave every namespace a vector instance that was responsible for that namespace and pods within. ElasticSearch or whatever you pipe logging data to might not be happy with all those TCP connections.
Just my SRE brain thoughts.
Vector is a daemonset, because it needs to tail the log files on each node. A single vector per namespace might not reside on the nodes that each pod is on.
There’s about a factor of 3 improvement that can be made to most code after the profiler has given up. That probably means there are better profilers than could be written, but in 20 years of having them I’ve only seen 2 that tried. Sadly I think flame graphs made profiling more accessible to the unmotivated but didn’t actually improve overall results.
If you see a database query that takes 1 hour to run, and only touches a few gb of data, you should be thinking "Well nvme bandwidth is multiple gigabytes per second, why can't it run in 1 second or less?"
The idea that anyone would accept a request to a website taking longer than 30ms, (the time it takes for a game to render it's entire world including both the CPU and GPU parts at 60fps) is insane, and nobody should really accept it, but we commonly do.
https://en.wikipedia.org/wiki/Speed_of_light
Just as an example, round trip delay from where I rent to the local backbone is about 14mS alone, and the average for a webserver is 53mS. Just as a simple echo reply. (I picked it because I'd hoped that was in Redmond or some nearby datacenter, but it looks more likely to be in a cheaper labor area.)
However it's only the bloated ECMAScript (javascript) trash web of today that makes a website take longer than ~1 second to load on a modern PC. Plain old HTML, images on a reasonable diet, and some script elements only for interactive things can scream.
mtr -bzw microsoft.com
6. AS7922 be-36131-cs03.seattle.wa.ibone.comcast.net (2001:558:3:942::1) 0.0% 10 12.9 13.9 11.5 18.7 2.6
7. AS7922 be-2311-pe11.seattle.wa.ibone.comcast.net (2001:558:3:3a::2) 0.0% 10 11.8 13.3 10.6 17.2 2.4
8. AS7922 2001:559:0:80::101e 0.0% 10 15.2 20.7 10.7 60.0 17.3
9. AS8075 ae25-0.icr02.mwh01.ntwk.msn.net (2a01:111:2000:2:8000::b9a) 0.0% 10 41.1 23.7 14.8 41.9 10.4
10. AS8075 be140.ibr03.mwh01.ntwk.msn.net (2603:1060:0:12::f18e) 0.0% 10 53.1 53.1 50.2 57.4 2.1
11. AS8075 2603:1060:0:10::f536 0.0% 10 82.1 55.7 50.5 82.1 9.7
12. AS8075 2603:1060:0:10::f3b1 0.0% 10 54.4 96.6 50.4 147.4 32.5
13. AS8075 2603:1060:0:10::f51a 0.0% 10 49.7 55.3 49.7 78.4 8.3
14. AS8075 2a01:111:201:f200::d9d 0.0% 10 52.7 53.2 50.2 58.1 2.7
15. AS8075 2a01:111:2000:6::4a51 0.0% 10 49.4 51.6 49.4 54.1 1.7
20. AS8075 2603:1030:b:3::152 0.0% 10 50.7 53.4 49.2 60.7 4.2The sympathy is also needed. Problems aren't found when people don't care, or consider the current performance acceptable.
> There’s about a factor of 3 improvement that can be made to most code after the profiler has given up. That probably means there are better profilers than could be written, but in 20 years of having them I’ve only seen 2 that tried.
It's hard for profilers to identify slowdowns that are due to the architecture. Making the function do less work to get its result feels different from determining that the function's result is unnecessary.
All of which have gotten perhaps an order of magnitude worse in the time since I started on this theory.
I'm curious, what're the profilers you know of that tried to be better? I have a little homebrew game engine with an integrated profiler that I'm always looking for ideas to make more effective.
The common element between attempts is new visualizations. And like drawing a projection of an object in a mechanical engineering drawing, there is no one projection that contains the entire description of the problem. You need to present several and let brain synthesize the data missing in each individual projection into an accurate model.
These kind of resource explosions are something I see all the time in k8s clusters. The general advice is to always try and keep pressure off the k8s API, and the consequence is that one must be very minimal and tactical with the operators one installs, and then engage in many hours of work trying to fine tune each operator to run efficiently (e.g. Grafana, whose default helm settings do not use the recommended log indexing algorithm, and which needs to be tweaked to get an appropriate set of read vs. write pods for your situation).
Again, I recognize there is a tradeoff here - the simplicity and openness of the k8s API is what has led to a flourish of new operators, which really has allowed one to run "their own cloud". But there is definitely a cost. I don't know what the solution is, and I'm curious to hear from people who have other views of it, or use other solutions to k8s which offer a different set of tradeoffs.
Aren't they supposed to use watch/long polling?
There were recent changes to the NodeJS Prometheus client that eliminates tag names from the keys used for storing the tag cardinality for metrics. The memory savings wasn’t reported but the cpu savings for recording data points was over 1/3. And about twice that when applied to the aggregation logic.
Lookups are rarely O(1), even in hash tables.
I wonder if there’s a general solution for keeping names concise without triggering transposition or reading comprehension errors. And what the space complexity is of such an algorithm.
> keeping names concise without triggering transposition or reading comprehension errors.
Code that doesn’t work for developers first will soon cease to work for anyone. Plus how do you look up a uuid for a set of tags? What’s your perfect hash plan to make sure you don’t misattribute stats to the wrong place?
UUIDs are entirely opaque and difficult to tell apart consistently.