Do monitoring tools still miss early signals before incidents?

3 pointsgabdiax3mo ago5 comments

I'm curious how teams detect early signals of infrastructure problems before they turn into incidents.

In some environments I've seen cases where monitoring alerts arrive only after the system is already degrading.

Examples: - disk usage spikes faster than expected - network latency gradually increases - services degrade slowly before failing

Tools like Datadog, Zabbix, Prometheus etc. are great for alerts, but they still feel mostly reactive.

How do you deal with this in your infrastructure?

Do you rely more on: - anomaly detection - predictive monitoring - custom scripts - or just good incident response?

I'm trying to understand what actually works in real-world environments.

Do monitoring tools still miss early signals before incidents?

3 pointsgabdiax3mo ago5 comments

I'm curious how teams detect early signals of infrastructure problems before they turn into incidents.

In some environments I've seen cases where monitoring alerts arrive only after the system is already degrading.

Examples: - disk usage spikes faster than expected - network latency gradually increases - services degrade slowly before failing

Tools like Datadog, Zabbix, Prometheus etc. are great for alerts, but they still feel mostly reactive.

How do you deal with this in your infrastructure?

Do you rely more on: - anomaly detection - predictive monitoring - custom scripts - or just good incident response?

I'm trying to understand what actually works in real-world environments.

5 comments

4 comments · 2 top-level

zippyman553mo ago· 2 in thread

My team was responsible for the system administration of a large scale HPC center. We seemed to get blamed, incorrectly, for a lot of sloppy user code. I implemented statistical process controls for job aborts, and reported the results as mean time to failure rates over the years. It was pretty cool, as I could respond with failure rates for each of several thousand different programs. What did not work was changing the culture to get people to improve their code. But I was able to push back hard when my team was arbitrarily blamed for someone else’s bad code. It was easy to show that a jobs failure rate was increasing and link it to a recent upgrade or change. But, I felt I was often just shining the flashlight at an issue and trying to encourage a responsible party to take ownership.

gabdiaxOP3mo ago

That's really interesting. Using statistical process control for failure rates in HPC systems sounds like a very solid approach.

In your experience, were there usually early signals in metrics before job failures increased? For example patterns like latency changes, resource saturation or network anomalies.

I'm trying to understand whether those signals appear consistently enough to detect issues before incidents actually happen.

zippyman553mo ago

For the mean time to failure, I based it on a section out of Mastering Statistical Process Control, By Tim Stapenhurst. Specifically, The section on using SPC to measure earthquakes, etc. The system worked pretty well, ran for years, and using R, I built a free system to monitor all the job schedule information for our HPC systems. I’d present the most egregious information in the form of a daily Pareto chart. I’d attempt to shame the code owners when they would appear at the top of the Pareto chart. But, mostly, I just did not want people having their go-to excuse of blaming the system administrators, when it was really their recent code update. There were other SPC charts, which one could drill down and look at job run times, or which nodes the jobs ran on, etc. But working the culture to get people to be responsible for their applications was a little out of my wheelhouse, and always a challenge. For those few people who really embraced their application ownership and wanted to make sure things ran well, it that was always nice. It was always nice to say something like, “your job used to crash 3 times a year and now it seems to be crashing 6 times a year.” At least, we would have a good point to discuss potential causes. I know some of the developers got sucked into tools like Splunk, but to me, that was always cost prohibitive for our budget and our volume of data. Answering your question about “early signals in metrics before job failures increased” the mean time to failure SPC chart would show a job failure signature and if there were problem nodes, or problems with a software update, that would become apparent to allow further investigation. The other SPC charts, like job run time would show things like increased job run time, etc. But, that was pretty basic stuff (and lots of tools can do that stuff), such as a user was generating a daily tar-file, which was growing over time and eventually filling up a file system, etc. But getting people to take action always seemed so hard.

1 more reply

gabdiaxOP3mo ago

One thing I'm particularly curious about is whether teams see early signals in metrics or logs before incidents actually happen.

For example: - unusual latency patterns - slow resource saturation - network anomalies

Do people actively monitor these patterns or mostly rely on threshold alerts?

j / k navigate · click thread line to collapse

5 comments

4 comments · 2 top-level

zippyman553mo ago· 2 in thread

gabdiaxOP3mo ago

That's really interesting. Using statistical process control for failure rates in HPC systems sounds like a very solid approach.

In your experience, were there usually early signals in metrics before job failures increased? For example patterns like latency changes, resource saturation or network anomalies.

I'm trying to understand whether those signals appear consistently enough to detect issues before incidents actually happen.

zippyman553mo ago

1 more reply

gabdiaxOP3mo ago

One thing I'm particularly curious about is whether teams see early signals in metrics or logs before incidents actually happen.

For example: - unusual latency patterns - slow resource saturation - network anomalies

Do people actively monitor these patterns or mostly rely on threshold alerts?

j / k navigate · click thread line to collapse