I work in IT, I am a geek so I tried a few monitoring systems and wrote two myself.
Then I realized that I have self-sustaining, 24/7 monitoring agents: wife and children.
I gave up trying to have the right stack and just wait for them to yell.
Seriously: it works great and it made me wonder WHY I am trying to monitor. Turns out this is more for the fun, discovery of tools than a real need at home.
When they were young they were definitely not self sustaining.
As teenagers they now live on food (either provided when it meets their standards, or the one they cook themselves), water and wi-fi.
I've not found it too hard to stay within the limits of the free tier. The 10 dashboards limit is the main one that actually constrains me, but I just put more stuff on each dashboard and live with the scrolling. The free retention is not great but it's good enough for my purposes.
Also 14 days retention is not useful for home, I want to know temperature and power stats from last winter, not from last 2 weeks.
Even the "first paid" tier contains only 13 months of retention
I just used VictoriaMetrics all-in-one binary for home stuff + grafana as visualisation
If the poster hosted those services in a single node k3s or something, the kube-prometheus-stack helm chart is able to deploy a lot of those tools easily.
This. Although it can be fun to learn, I've done that, got the t-shirt (literally from a conference)
But I did similar stuff for work so I already had the skills.
...and also for one of my side projects, OSRBeyond.
It's easy to get overwhelmed by all the moving pieces, but it's also a lot of _fun_ to set up.
Exactly my thoughts! Isn't there something (open source and as good as Prometheus+Grafana) that doesn't have as many moving parts as the stack used by OP? I can imagine there are many use cases for that: from side projects (homelabs) to small startups that don't have huge distributed systems, but still need monitoring (without relying on third-parties).
Ideally, my setup would be:
- install an agent in each server I'm interesting in gathering metrics from. In this regard, Prometheus works just fine
- one service to handle logs/metrics/traces ingestion and that allows you to search and visualize your stuff in nice dashboards. Grafana works, but it doesn't support logs and traces out of the box (you need Loki for that)
So, basically 2 pieces of software (if they can be installed by just dropping a binary, even better)
Vector[1] would work as the agent, being able to collect both logs and metrics. But the issue would then be storing it. I'm assuming the Elastic Stack might now be able to do both, but it's just to heavy to deal with in a small setup.
A couple of months ago I took a brief look at that when setting up logging for my own homelab (https://pv.wtf/posts/logging-and-the-homelab). Mostly looking at the memory usage to fit it on my synology. Quickwit[2] and Log-Store[3] both come with built in web interfaces that reduce the need for grafana, but neither of them do metrics.
- [1] https://vector.dev - [2] https://quickwit.io/ - [3] https://log-store.com/
Alternatives could be other general purpose databases.
Telegraf have some log parsing/extraction functionality, but for something more generic promtail+loki would be better.
Grafana doesn't support anything out of the box by that logic. Before you get any viz in Grafana you have to add a data source cmon.
Good luck! It's a lot.
Supports Prometheus querying and few other formats for ingesting so any knowledge bout "how to get data into prometheus" applies pretty much 1:1 + their own vmagent is pretty advanced. Not related to company in any way, just a happy user.
I'd love your feedback on how this process could be easier for me, some resources on learning the Grafana query languages, and general comments.
Thanks for taking the time to read + engage!
* ZFS pool errors. Motivator: one of my HDDs failed and it took me a few days to notice. The pool (raidz1) kept chugging along of course.
* HDD and SSD SMART errors
* High HDD and SSD temperatures
* ZFS pool utilization
* High CPU temperature. Motivator: one of my case fans failed and it took a while for me to notice.
* High GPU temperatures. Motivator: I have two GPUs in my tower, one of which I don't really monitor (used for transcoding).
* High (sustained) CPU usage. I track this at the server level, rather than for individual VMs.
You can run numbers manually but I think designing for it up front is really important to keep performance targets on lock. That's where Prometheus and Grafana come in. And I think looking at performance numbers is a really good way to help understand systems dynamics and helps you ask why something is hitting some threshold. On the other hand, there are so many tools and they're often fun to play with, it's easy to get carried away. There's also a pretty reasonable amount of complexity involved in setting it up, so it's also easy to just say fuck it a lot of times and respond to issues on demand instead.
[1] http://k6.io/, it's also a Grafana project.
[2] It can test both normal REST endpoints but also browsers thanks to the use of headless chrome/chromium! So you can actually look at first paint latency and things like that too.
Zabbix has been quite solid and has lots of templates for different servers (linux, windows, etc), triggers and can also monitor docker containers (although i never tried that).
The only thing Zabbix cant do well is log file monitoring, so I am considering something like an ELK stack as an addition.
I cannot find my way around the Zabbix web interface neither and most of the templates, rules and macros system confused me, deeply.
On the other hand we have a Prometheus + Grafana stack for another system and the model makes all the sense to me. I guess that there is something in time series and graph plotting that just clicks with me.
1: https://docs.timescale.com/api/latest/hyperfunctions/time_bu...
[1] https://docs.timescale.com/use-timescale/latest/time-buckets...
I collect logs with Vector on each instances and sent to central ClickHouse which Metabase reads from.
Used this tutorial:
https://clickhouse.com/docs/en/integrations/vector
My services usually produce around 2GB of log data per day. From quick read on the CH I beleive it should not be a problem. Not sure how big the database it is but zip compressed log data is around that size for entire month.
I tried loki around v1.0 and it didn't seem to offer much back then...
I don't like its own UI but no need to use it and it can easily gather metrics from systemd services and containers.
https://video.nstr.no/w/hjTH3Vggn2fvpTrQitMmVP
I would like to set up Grafana and more monitoring as well, on some of my other machines. But for now this is what I have :D
I've found that the ability to (pre)configure Grafana without clicking around in it is pretty difficult.
- monitoring sql databases with basic sql queries
- monitoring host cpu, ram and disk usage
- monitoring docker containers
- and being able to monitor all of this through ssh tunnels because not all my services are on the internet
For my use case, a home media server, Netdata turned out to be way simpler to set up, and, most importantly, way less of a hassle/dink-around. It's a basic plug-and-play operation with auto-discovery. While the dashboard isn't nearly as beautiful or configurable, it gets the job done and provides everything I pretty much need or want. It offers a quick overview, historical metrics (over a year of data) to analyze trends or spot potential issues, and push/email notifications if something goes awry.
If you decide to go down this route, there are two major items:
1. You'll need to configure the dbengine[1] database to save and store historical metric data. However, I found the dbengine configuration documentation to be a bit confusing, so I'll spare you the trouble - just use this Jupyter Notebook[2]. If needed, adjust the input, run it, scroll down, and you'll see a summary of the number of days, the maximum dbengine size, and the yaml config, which you can copy, paste, and voila.
2. If you're hoarding data, you'll probably want to set up smartmontools/smartd[3] in a separate Docker container for better disk monitoring metrics. However, I think you can enable hddtemp[4] with Netdata through the config if you don't want or need the extra hassle. You can have Netdata to query this smartd container, but with a handful of disks, it ends up timing out frequently, so I found it's best to simply set up smartd/smartd.conf to log out the smartd data independently. Then all you need to do is tell Netdata where to find the smartd_log[5], and Netdata handles the rest.
Boom, home media server metrics with historical data, done. It still takes a bit of time to set up, but way less than Grafana. Anywho, hopefully, this saves you from wasting as much time as I did. And if you're looking for a smartd reference, shoot me a reply, and I'll tidy up and share my Docker config/scripts and notes.
[1] https://learn.netdata.cloud/docs/typical-netdata-agent-confi... [2] https://colab.research.google.com/github/andrewm4894/netdata... [3] https://www.smartmontools.org/wiki [4] https://github.com/vitlav/hddtemp [5] https://learn.netdata.cloud/docs/data-collection/storage,-mo...
I don't know if it has the same features or not, but it looks like you can set it up yourself.
Aligning metric endpoints for fine-tuning.
Add tracing to it in a few more clicks