I really want to like it, it’s just so _easy_, publish a little webpage with your metrics and Prometheus takes care of the rest. Lovely.
But I often find that the cardinality of the data is substantially lower than even the defaults of alternatives (influxdb has 1s and even Zabbix has 5s).
Not to mention the lost writes (missing data points) which have no logged explanation.
All of this, however, was in my homelab, which, while unconstrained in resources lacks a lot of the fit and finish of a prod system.
I also take pause with the architecture; it’s not meant to scale. It’s written on the tin so it’s not like I’m picking fault, but when you’re building a dashboard that sucks in data from 25 different Prometheus data sources, it becomes difficult to run functions like SUM(), because the keys may be out of sync causing some really ugly and inaccurate representations of data.
Everything about the design (polling, single database) tells me that it was designed primarily to sit alongside something small. It could never handle the tens of millions of data points per second that I ingest(ed) at my (now previous) job.
But it has a lot of hype, and maybe I’m holding it wrong.
Prometheus is designed to be "functionally sharded". You shouldn't be running one "mega prometheus". Often it's something like 1 Prometheus per-team, depending on the amount of metrics each produces.
You can use federation at lower resolutions or a one of the distributed setups (Thanos/Cortex) if you want to avoid extra storage or lower resolution that federation entails.
> But I often find that the cardinality of the data is substantially lower than even the defaults of alternatives
Not to distract, but I think you meant resolution, not cardinality. Cardinality is the metadata like labels/dimensions. Resolution is the granularity in the time.
- https://www.robustperception.io/evaluating-performance-and-c...
- https://medium.com/@valyala/evaluating-performance-and-corre...
Being able to also control which metrics are important to my team vs the wider team is a BIG bonus of this sort of decentralised system.
Biggest issue I've had was an app that was accidentally publishing several thousand metrics which caused the default scrape timeout of 15s to kick in.
(It was publishing Kafka lag per consumer group per topic, which was fine and dandy, until someone released an app that runs about 500 instances at peak, and scaled up and down frequently, and had incorporated the pod id into the consumer group names, which led to Kafka tracking many, many, many consumer groups. Given that the consumers were low value anyway, we now just exclude them from having their lag tracked.)
Prometheuses.
ii is for latin words. Prometheus is/was Greek. I guess you could use Prometheoí but it would quickly derail any conversation. :)
Really? Recently we've been playing with Chronograf with InfluxDB and most people find it a lot nicer to work with than Grafana (specifically because it makes discoverability a lot nicer)
FWIW I've had similar issues with MySQL backed Zabbix before.
But woe betide the team that has to run it as a service. Not that other metrics systems are better but Prometheus can be brutal in that space.
As a ‘squad level’ tool its really good. After that it gets hairy fast.
For the time being this is a "full team working full 40 hour weeks for year(s)" problem, so I'd be shocked to see it done open source.
BTW, I'm working on VictoriaMetrics - open source monitoring solution that works out of the box. See https://github.com/VictoriaMetrics/VictoriaMetrics
Coming from a monitoring system that supports push and pull with elegant auto-discovery, we're struggling to work out a sane architecture around (effectively pull-only) Prometheus.
We're still a bit stuck trying to replicate all the make-life-easy functionality we get with Zabbix sitting on a honking great PostgreSQL / Timescale database, with a bunch of proxies, and automated agent installs that auto-register.
There's places that doesn't work well (k8s, f.e.) but for conventional fleet metrics it's difficult to abandon.
I expect we won't outright replace, but rather augment, especially in spaces where a host-centric tool like Zabbix isn't ideal.
Partly it's driven by a need to monitor things like k8s (in the form of openshift) and pub/sub systems (eg kafka), and to integrate with other data sources (eg elastic).
Possibly more compelling is the need to do more sophisticated things with our data than we can conveniently accomplish with the Zabbix data store -- it's not the DB performance or scalability (PostgreSQL and optionally TimescaleDB) so much as dealing with the schema. Mildly sophisticated wrangling of our data ranges from difficult to impossible.
There's a couple of ways around that - bespoke tooling to facilitate ad hoc interrogations into the DB, duplicate the data at ingest time into multiple datastores, frequent ETL of the Zabbix SQL data into long term (time series) storage. None of these are great options. Plus we're fans of Grafana, so some of our decisions are, and will be, based around maintaining or improving end-user experience of that tool -- and while the Zabbix integration is excellent, the Prometheus integration is even better, so (on the end-user side) that's a highly compelling path.
You're probably exactly looking for something like that. In fact, I've given a talk about a similar scenario at the KubeCon San Diego: https://www.youtube.com/watch?v=FrcfxkbJH20
Disclosure: I work on Thanos and Thanos Receiver which implements that protocol.
The caveat is that I have no metrics when the laptop is offline but that doesn't happen very often anyway.
[1] https://victoriametrics.github.io/vmagent.html
[2] https://prometheus.io/docs/operating/integrations/#remote-en...
It depends on how high fidelity you're talking but in my experience retaining these metrics can be valuable, not only for viewing seasonal trends already mentioned in another reply but for debugging problems. It can be helpful to be able to view prior events and compare metrics at those times to a current scenario, for example as a part of a postmortem analysis. I do agree that the usefulness of old metrics falls off with time. Metrics issued from a system 3 years ago likely have little in common with the system running today.
Depends on the metric IMO. There's a ton of use you can get out of forecasting and seasonality for anomaly detection, but you need data going back for that to have any chance. Many relevant operations metrics exhibit three levels of seasonality: daily (day/night) weekly (weekday/weekend) and annual (holidays, superbowls, media events). Being able to forecast network traffic inbound on a switch to find problems would require you to have 1y of data, effectively. You _might_ be able to discard some of the data but you'd lose some of the predictive capacity for say, the Super Bowl.
I didn’t know the age of the project, because I hadn’t heard of it. That’s why I go on to say that in actuality it has a ton of adoption and I’ve had a great experience with it.
However I’m still trying to nail down my high cardinality/highly unique metrics-like data story. What are people using?
I’ve heard a combination of Cassandra/BigTable and Spark as a potential solution?
https://medium.com/@valyala/measuring-vertical-scalability-f...
(Disclaimer: I work at Timescale)
They talk a lot about collaborative troubleshooting, and the user interface reflects that. It's actually fun (?!) to drill down from heatmaps to individual events with Honeycomb's little comparison charts lighting the way.
Currently I’m in AWS land and Athena has been mostly working for what I need but I haven’t really pushed it that hard yet.
We have a plan to split it down to less instances per node but it's worked well enough so far.