I must have spent about a week trying to learn just enough about prometheus and grafana (I had used grafana before with influx but for a different purpose) so that we could monitor temperature, memory, cpu, and disk (the bare minimum).
The goal was to have a single dashboard showing these critical metrics for all servers (< 100), and be able to receive email or sms alerts when things turned red.
No luck. After a week I had nothing to show for.
So I turned to Netdata. A one liner on each server and we had super sexy and fast dashboard for each server. No birds eye view, but fine. I then spent maybe 3-4 days trying to figure out how to get alerting to work (just email, but fine) and get temperature readings (or something like that).
No luck. By the end of week 2 I still had nothing, but a bunch of servers shutting down during peak hours.
Week 3 I said fuck it I'll do the stupidest thing and write my own stack. A bunch of shell scripts, deployed via ansible, capturing any metric I could think of, managed by systemd, posting to a $5/month server running a single nodejs service that would do in memory (only) averages, medians etc, and trigger alerts (email, sms, Slack maybe soon) when things get yellow or red.
By week 4 we had monitoring for all servers and for any metric we really needed.
Super cheap, super stable and absolutely no maintenance required. Sure, we probably can't monitor hundreds of servers or thousands of metrics, but we don't need to.
I really wanted to use something else, but I just couldn't :(