undefined | Better HN

0 pointsjedberg2y ago0 comments

Why kind of things did you debug with CPU/Memory/Storage telemetry that you couldn't have debugged by only turning those things on after you knew there was a problem?

0 comments

6 comments · 3 top-level

aPoCoMiLogin2y ago· 3 in thread

when storage is full, and you don't know about that, you can't release anything to enable the logs in first place.

jedbergOP2y ago

You can poll storage periodically though, you don't need to keep a constant metrics stream of where it's at. Also you can set up each machine to alert when it's own storage fills up.

Also, as your storage hits 97%+, you'll probably start seeing effects in your business metrics, and then you can look into it.

aPoCoMiLogin2y ago

I think that you are confusing real-time metrics, streamed with very high precision (below 1s) and metrics that are simply polled every N time (most use-cases).

real-time, high precision metrics aren't necessary. when you say that you don't need metrics and then say that you can poll metrics periodically, you are contradicting yourself.

1 more reply

sofixa2y ago

> You can poll storage periodically though, you don't need to keep a constant metrics stream of where it's at. Also you can set up each machine to alert when it's own storage fills up.

Unless you want to be able to have trends over time, either for capacity planning (needing to order more storage in case of bare metal, or planning costs ahead) or to correlate with other things (storage consumption is growing twice as fast since deployment X, did we change something there?).

You don't need to have 1s granularity metrics on storage consumption, but having none is just stupid levels of fake "optimisation" that will cost you more in the long run.

Izkata2y ago

Identifying patterns where problems coincide with other processes or times, eventually tracking it down to a release done by another team.

It's happened to me a few times.

tveita2y ago

So your business metrics suddenly dropped, but what has changed?

This service is using 80% CPU, that seems a bit high... but is it always this high? Looks like it spiked within the last hour. But wait, it does that every Monday at 9 am, so probably a red herring.

This cache has a hit ratio of 60%... is that good? A bit low? Actually it's suspiciously high compared to last week - looks like a lot of people aren't getting a personalised feed.

Metrics are incredibly cheap to keep around for the value you get from a good operational dashboard, despite what Datadog/Amazon/Grafana Cloud tells you. It's just the most egregiously overpriced data you can buy since 20 cent text messages.

A good start is to set up VictoriaMetrics with some collectors and set retention to 14 days.

j / k navigate · click thread line to collapse