Interesting, I have a very small team that maintains a lot of infrastructure including Prometheus. Setting up the monitoring for 5 datacenters, thousands of hosts and hundreds of services was maybe a ~2 weeks effort. We then organically add and remove things but it is maybe a ~4h/week effort.
It give us a lot of value. We were using other solutions before (sensu/nagios) and it is night and day.
All the plugins that we use are listed in the official Prometheus website: https://prometheus.io/docs/instrumenting/exporters/ we haven't looked outside this page.
--
The cloud does make a difference. Just seeing the daily S3 usage per bucket was life changing. Immediately found that backups were not expiring after a while as they should, costing more and more money. ^^
Yeah this kind of visibility is really lacking on-prems. Storage is something that is hard, at scale, to see what is using what.
--
Do you know how many metrics you are ingesting in prometheus? storage size? and how many tags per host? We were reaching 1 TB of memory usage (mmap) on our server with 1500 hosts. Prometheus was literally grinding to a halt or crashing, was forced to cut down some metrics and stick to the absolute minimum tags.
We have around 500 GB of disk dedicated to prometheus server datacenter with a retention policy of 1 month. The VMs have around 16 GB of RAM.
I would say 95% of the hosts only have node exporter. Then 5% of the hosts will have redis/elasticsearch/postgres/mysql/etc exporters.
Also probably around 5% of the hosts have some sort of custom metrics. We have been leveraging the file exporter for writing custom service metrics.
Custom service metrics goes through a merge request process where we see the best way to structure them to avoid big cardinality. We also use recorded rules to pre-compute things that would be expensive to query (think CPU usage across a datacenter)
1 TB of RAM usage sounds insane. It looks like we can horizontally scale the prometheus servers and use some documented features for that.
For our use case it has been a very smooth ride so far!