Systems Monitoring with Prometheus and Grafana (opens in new tab)

(flightaware.engineering)

198 pointsjsulak5y ago104 comments

104 comments

75 comments · 11 top-level

halfmatthalfcat5y ago· 13 in thread

Prometheus and Grafana are awesome, use them personally for all my monitoring.

However I’m still trying to nail down my high cardinality/highly unique metrics-like data story. What are people using?

I’ve heard a combination of Cassandra/BigTable and Spark as a potential solution?

latchkey5y ago

I found this interesting. My plan is to move from Prom to Victoria.

https://medium.com/@valyala/measuring-vertical-scalability-f...

akulkarni5y ago

Just a heads up, this is an old comparison (over 1 year ago) that hasn't been updated since TimescaleDB now supports native compression. (Blog post references TimescaleDB 1.2.2, the product is now on 1.7.2).

sagichmal5y ago

Woof, good luck. Not a great product.

1 more reply

akulkarni5y ago

TimescaleDB is a long-term storage option for Prometheus metrics, has no problem with high-cardinality, and now natively supports PromQL (in addition to SQL) [0]

(Disclaimer: I work at Timescale)

[0] https://github.com/timescale/timescale-prometheus

chinhodado5y ago

I'm just starting to look into this and have a question. If I can export my metrics directly to TimescaleDB and it supports visualization with Grafana, is there any reason to go through Prometheus?

1 more reply

Legogris5y ago

I'd be curious to hear if anyone has done serious evaluation of high-cardinality use-cases of Victoriametrics.

chucky_z5y ago

I went from an Influx getting crushed to VM running in a container with 1/8th the resources and it works fine, 1.5m active cardinality. Could handle a lot more probably. Auto fill in Grafana breaks but oh well!

jdub5y ago

Honeycomb https://honeycomb.io/ is laser focused on this stuff. They built their own datastore (similar to Druid but schemaless) so they could create the experience they were aiming for.

They talk a lot about collaborative troubleshooting, and the user interface reflects that. It's actually fun (?!) to drill down from heatmaps to individual events with Honeycomb's little comparison charts lighting the way.

Nihilartikel5y ago

I've used druid.io in the past and it had worked well, but it's a lot of trouble to set up and tune.. Haven't tried it, but clickhouse looks good and has approximate aggregations for high cardinality dimensions.

jpgvm5y ago

Druid truly is still king in this space. The setup has become less onerous over time. It handles arbitrarily high cardinality and dimensionality with ease and its support for sketching algorithms leaves other similar systems (especially Prometheus) in the dust.

jrott5y ago

Spark has worked decently for me if you need to be cloud agnostic.

Currently I’m in AWS land and Athena has been mostly working for what I need but I haven’t really pushed it that hard yet.

sagichmal5y ago

Just curious what your numbers are? Unique metrics, cardinality per metric, ingest rate, expected query ranges?

base6985y ago

I have an instance that scrapes about 30K targets for 15 million metrics and it works better than you'd expect. The biggest performance issue we have is rendering the targets page.

We have a plan to split it down to less instances per node but it's worked well enough so far.

dijit5y ago· 12 in thread

Grafana truly is best in class, but I have strong reservations about Prometheus.

I really want to like it, it’s just so _easy_, publish a little webpage with your metrics and Prometheus takes care of the rest. Lovely.

But I often find that the cardinality of the data is substantially lower than even the defaults of alternatives (influxdb has 1s and even Zabbix has 5s).

Not to mention the lost writes (missing data points) which have no logged explanation.

All of this, however, was in my homelab, which, while unconstrained in resources lacks a lot of the fit and finish of a prod system.

I also take pause with the architecture; it’s not meant to scale. It’s written on the tin so it’s not like I’m picking fault, but when you’re building a dashboard that sucks in data from 25 different Prometheus data sources, it becomes difficult to run functions like SUM(), because the keys may be out of sync causing some really ugly and inaccurate representations of data.

Everything about the design (polling, single database) tells me that it was designed primarily to sit alongside something small. It could never handle the tens of millions of data points per second that I ingest(ed) at my (now previous) job.

But it has a lot of hype, and maybe I’m holding it wrong.

ecnahc5155y ago

> Erything about the design (polling, single database) tells me that it was designed primarily to sit alongside something small.

Prometheus is designed to be "functionally sharded". You shouldn't be running one "mega prometheus". Often it's something like 1 Prometheus per-team, depending on the amount of metrics each produces.

You can use federation at lower resolutions or a one of the distributed setups (Thanos/Cortex) if you want to avoid extra storage or lower resolution that federation entails.

> But I often find that the cardinality of the data is substantially lower than even the defaults of alternatives

Not to distract, but I think you meant resolution, not cardinality. Cardinality is the metadata like labels/dimensions. Resolution is the granularity in the time.

AlphaSite5y ago

Doesn’t it also have cardinality issues?

2 more replies

Legogris5y ago

Victoriametrics[0] is API-compatible with Prometheus but also a horizontally scalable, distributed and persisting timeseries database (cf influxdb). Together with vmagent it essentially becomes a HA drop-in replacement (almost) for Prometheus.

[0]: https://github.com/VictoriaMetrics/VictoriaMetrics

dewey5y ago

These might be a good read if you are considering it:

- https://www.robustperception.io/evaluating-performance-and-c...

- https://medium.com/@valyala/evaluating-performance-and-corre...

ashtonkem5y ago

My understanding is that Prometheus is designed for you to deploy multiple instances within your company, rather than deploying a limited number of instances for the company or division. So I would reasonably run a Prometheus instance by myself or with my neighboring teams rather than depending on a centralized instance run by $OPS.

scaryclam5y ago

This is how we use it and it works well. Other teams are also free to use whatever else they want and if we need an "overview" it's pretty easy to upstream certain metrics elsewhere (say, a centralised system run by ops) to collate together.

Being able to also control which metrics are important to my team vs the wider team is a BIG bonus of this sort of decentralised system.

1 more reply

EdwardDiego5y ago

We've used Thanos to aggregate multiple Prometheus (Promethii?) across our clusters to enable us to scale, each Prometheus deals only with a subset of scrape targets.

Biggest issue I've had was an app that was accidentally publishing several thousand metrics which caused the default scrape timeout of 15s to kick in.

(It was publishing Kafka lag per consumer group per topic, which was fine and dandy, until someone released an app that runs about 500 instances at peak, and scaled up and down frequently, and had incorporated the pod id into the consumer group names, which led to Kafka tracking many, many, many consumer groups. Given that the consumers were low value anyway, we now just exclude them from having their lag tracked.)

fnord1235y ago

>Promethii

Prometheuses.

ii is for latin words. Prometheus is/was Greek. I guess you could use Prometheoí but it would quickly derail any conversation. :)

1 more reply

skohan5y ago

> Grafana truly is best in class

Really? Recently we've been playing with Chronograf with InfluxDB and most people find it a lot nicer to work with than Grafana (specifically because it makes discoverability a lot nicer)

_xrjp5y ago

For our modest cloud infra, InfluxData TICK (InfluxDB, Kapacitor, Chronograf and Telegraf) stack has fitted exactly with our needs. We really like its folding building-blocks, interoperability and yeah... easy discoverability and configuration. But also its very convenient InfluxQL which lets us customize reports with ease on InfluxDB.

VectorLock5y ago

Recent Grafana's Explore interface is much nicer.

jtl9995y ago

> Not to mention the lost writes (missing data points) which have no logged explanation.

FWIW I've had similar issues with MySQL backed Zabbix before.

bacondude35y ago· 12 in thread

What's a good alternative to Prometheus when pulling stats is impractical? Say I want to monitor a personal laptop like I would a server. It will change networks and IP addresses, so pulling would be impractical to configure, whereas the laptop could easily(?) push its stats to a remote server.

MetalMatze5y ago

Prometheus supports writing (replicating) data to a remote endpoint on a per scrape basis with a protocol called remote-write. You can pretty easily set that up on any Prometheus instance. There are quite some implementations to receive those remote-write requests: https://prometheus.io/docs/operating/integrations/#remote-en...

You're probably exactly looking for something like that. In fact, I've given a talk about a similar scenario at the KubeCon San Diego: https://www.youtube.com/watch?v=FrcfxkbJH20

Disclosure: I work on Thanos and Thanos Receiver which implements that protocol.

bacondude35y ago

Oh, I had no idea about that. Thanks!

cassianoleal5y ago

I've been using ZeroTier [0] so whenever my laptop is online, it's also accessible by Prometheus so it can be scraped. I run ZeroTier in a container on my router, and on each "road warrior" that needs LAN access (laptop, phone...).

The caveat is that I have no metrics when the laptop is offline but that doesn't happen very often anyway.

[0] https://www.zerotier.com/

bacondude35y ago

That looks like a really cool solution. Thanks for the suggestion!

valyala5y ago

You can run lightweight vmagent [1] on your laptop. The vmagent collects all the metrics from your laptop and then sends them to remote storage when connection to the remote storage is available. It can send data to any remote storage system that support Prometheus remote_write protocol [2].

[1] https://victoriametrics.github.io/vmagent.html

[2] https://prometheus.io/docs/operating/integrations/#remote-en...

bacondude35y ago

That looks like a really promising solution. Thanks!

bvrmn5y ago

You can use push approach. Like graphite. Grafana has excellent support of graphite queries. Also graphite has low-footprint implementation https://github.com/baverman/hisser

molecule5y ago

Telegraf + InfluxDB?

abhishekjha5y ago

This is my setup on all my raspberrypis. I have not be been able to figure out how to monitor a cluster though. I saw that grafana free tier doesn’t allow a cluster of servers getting monitored. I have telegraf + influxdb + grafana installed on all my servers.

1 more reply

valyala5y ago

vmagent + victoriametrics :)

cortesoft5y ago

Use a push gateway?

ncrmro5y ago

If you use wireguard you can tell Prometheus to scrape your ip on the wireguard network

site-packages15y ago· 7 in thread

What do you all do with the collected metrics over time? Do you store everything forever, drop everything after a couple weeks, or something on between? I've heard of people thinning out old data a bit (?) and storing it long term rather than storing everything. What's the usual thing people do?

sagichmal5y ago

High fidelity operational metrics have a useful half life measured in days or weeks. Read patterns for longer term use cases are also categorically different. Best architecture is to have a separate system for long term stuff, which treats Prometheus as a data source. Then Prometheus can drop after 14-28d.

dxhdr5y ago

> High fidelity operational metrics have a useful half life measured in days or weeks.

It depends on how high fidelity you're talking but in my experience retaining these metrics can be valuable, not only for viewing seasonal trends already mentioned in another reply but for debugging problems. It can be helpful to be able to view prior events and compare metrics at those times to a current scenario, for example as a part of a postmortem analysis. I do agree that the usefulness of old metrics falls off with time. Metrics issued from a system 3 years ago likely have little in common with the system running today.

jldugger5y ago

> High fidelity operational metrics have a useful half life measured in days or weeks.

Depends on the metric IMO. There's a ton of use you can get out of forecasting and seasonality for anomaly detection, but you need data going back for that to have any chance. Many relevant operations metrics exhibit three levels of seasonality: daily (day/night) weekly (weekday/weekend) and annual (holidays, superbowls, media events). Being able to forecast network traffic inbound on a switch to find problems would require you to have 1y of data, effectively. You _might_ be able to discard some of the data but you'd lose some of the predictive capacity for say, the Super Bowl.

1 more reply

codeduck5y ago

7 day retention in Prometheus, pushing to something like VictoriaMetrics for downsampling and long term storage. Prometheus is great for collection but rubbish for managing large data sets

Legogris5y ago

One thing you can do is configure compressions, so essentially less recent data has lower time resolution and/or less cardinality. So some dimensions are dropped and you only have e.g. 1h resolution for data older than some threshold.

bjakubski5y ago

Depends on your needs really. Some metrics we do (for now) keep indefinitely. We're using Thanos to ship data to bucket in object storage Some metrics we do keep for two weeks only.

heliodor5y ago

Then that's business data not monitoring data. Whole different use case and tools. One loses its value over time, the other doesn't.

Jedd5y ago· 6 in thread

We've got a somewhat similar landscape, on a pretty sizeable network - big investment in Zabbix and looking to move, perhaps slowly and perhaps only in part, towards Prometheus.

Coming from a monitoring system that supports push and pull with elegant auto-discovery, we're struggling to work out a sane architecture around (effectively pull-only) Prometheus.

EdwardDiego5y ago

There's a push gateway: https://github.com/prometheus/pushgateway

Jedd5y ago

Yeah, I think we've looked at that. It provides push for the last mile, and I suppose you could wrangle some auto-discovery using that tooling, but you're still doing pull from Prometheus to that/those server(s).

We're still a bit stuck trying to replicate all the make-life-easy functionality we get with Zabbix sitting on a honking great PostgreSQL / Timescale database, with a bunch of proxies, and automated agent installs that auto-register.

There's places that doesn't work well (k8s, f.e.) but for conventional fleet metrics it's difficult to abandon.

1 more reply

goranerik5y ago

What are the benefits or main reason for replacing zabbix with prometheus, especially since zabbix introduced prometheus checks?

Jedd5y ago

Good and valid question.

I expect we won't outright replace, but rather augment, especially in spaces where a host-centric tool like Zabbix isn't ideal.

Partly it's driven by a need to monitor things like k8s (in the form of openshift) and pub/sub systems (eg kafka), and to integrate with other data sources (eg elastic).

Possibly more compelling is the need to do more sophisticated things with our data than we can conveniently accomplish with the Zabbix data store -- it's not the DB performance or scalability (PostgreSQL and optionally TimescaleDB) so much as dealing with the schema. Mildly sophisticated wrangling of our data ranges from difficult to impossible.

There's a couple of ways around that - bespoke tooling to facilitate ad hoc interrogations into the DB, duplicate the data at ingest time into multiple datastores, frequent ETL of the Zabbix SQL data into long term (time series) storage. None of these are great options. Plus we're fans of Grafana, so some of our decisions are, and will be, based around maintaining or improving end-user experience of that tool -- and while the Zabbix integration is excellent, the Prometheus integration is even better, so (on the end-user side) that's a highly compelling path.

valyala5y ago

Take a look at VictoriaMetrics. It supports both pull and push models. It is inspired by Prometheus and it supports PromQL-inspired query language - MetricsQL [0].

[0] https://victoriametrics.github.io/MetricsQL.html

Jedd5y ago

Yup - it's on our radar for evaluation.

cmckn5y ago· 6 in thread

Prometheus is great. I first heard about it at KubeCon last fall, and kind of shrugged it off as one of those fledgling "cloud native" projects that I probably didn't need or didn't have time to learn. There's actually a lot of adoption, you can find great exporters and grafana dashboards for almost any OSS you're running today. I started collecting metrics from Zookeeper and HBase in about an hour, having never had access to that telemetry before. From the existence of Cortex[1], it seems that Prometheus doesn't scale incredibly well, but I don't think many users will hit these limits.

[1] https://cortexmetrics.io/

edoceo5y ago

My Prometheus system is a $10/mo Linode. It collects from 27 other hosts, and at least 100 services distributed across those hosts - doesn't even break a sweat. All the exporters run through a wireguard VPN. Prometheus is great for a small/medium SaaS type environment.

abhishekjha5y ago

What do you use as a frontend? As far as I could tell grafana free tier doesn’t allow monitoring cluster of servers.

2 more replies

sagichmal5y ago

Prometheus "scales" really well, but it does so via segmentation and federation, rather than increasing the size of an e.g. cluster. Some use cases don't fit to that model, so projects like Cortex and Thanos exist.

nielsole5y ago

not vertically at least. the memory usage for indexing has room for improvement. If I read the pprofs correctly, every scrape interval and every remote write allocates huge amounts of memory which is only cleaned up on garbage collection. You can easily need >64 gb ram for tenthousands of time series, otherwise you oom.

4 more replies

jeffbee5y ago

Interesting to say that Prometheus is "fledgling". The project is almost 7 years old and the Google thing on which it is based is ~15 years old.

cmckn5y ago

> I first heard about it at KubeCon last fall, and kind of shrugged it off as one of those fledgling...

I didn’t know the age of the project, because I hadn’t heard of it. That’s why I go on to say that in actuality it has a ton of adoption and I’ve had a great experience with it.

kasey_junk5y ago· 4 in thread

I have a love/hate relationship with Prometheus. If I had no budget for metrics its likely the thing I would reach for, but I’m dying for someone to open source a ‘next level’ metrics system (something like Monarch or Circonus but free).

But woe betide the team that has to run it as a service. Not that other metrics systems are better but Prometheus can be brutal in that space.

As a ‘squad level’ tool its really good. After that it gets hairy fast.

RhodesianHunter5y ago

>I’m dying for someone to open source a ‘next level’ metrics system

For the time being this is a "full team working full 40 hour weeks for year(s)" problem, so I'd be shocked to see it done open source.

valyala5y ago

Could you give more context on Monarch or Circonus features that are missing in Prometheus?

BTW, I'm working on VictoriaMetrics - open source monitoring solution that works out of the box. See https://github.com/VictoriaMetrics/VictoriaMetrics

kasey_junk5y ago

1) histograms as the basic primitive. 2) bidirectional transport 3) runtime configurable filtering at source, collection and sync. 4) provenance as part of the transport.

roskilli5y ago

M3 is meant to be an open source, central, horizontally scalable metrics store - but your mileage may vary. Either way, check it out: https://m3db.io

ablekh5y ago· 2 in thread

Interesting post. However, I believe that most content, and especially broad technical one like this, absolutely needs a balanced amount of relevant visual elements (e.g., images, diagrams). If you want it to be readable, that is.

icelancer5y ago

Yeah, odd post that talks a lot about Grafana and visualizations then uses absolutely none.

ablekh5y ago

Yes, it's quite ironic. Hopefully, the authors will eventually improve the post, because core content is valuable.

apihealth5y ago· 1 in thread

I've been looking into Prometheus + Grafana for other reasons. I have some 3rd party APIs connected through API gateway, which I need to health check and I couldn't find other open source alternatives. Gonna move the whole setup to cloud at some point but I'm not sure if this is the right thing to do. Does anyone have other articles/ open source tools which can be helpful to me? This article goes much deeper into how the setup can be used but I'm looking for more simpler use cases of the same setup, for the task I need to do.

valyala5y ago

Probably vmagent could be useful for your case? See https://victoriametrics.github.io/vmagent.html#use-cases

osn93637395y ago· 1 in thread

Does anyone have anything good or bad to share about using Grafana as a front end for metrics logged in AWS cloudwatch? I know it has a plug in and I'm fed up with how bad the cloudwatch dashboards are so wondering if I should check it out.

uaas5y ago

Well, I’d say give it a try. We are using CloudWatch as a Grafana datasource, because this way you can concentrate more of your monitoring to one place, which is useful during troubleshooting. With Grafana 7.x, you can even check and correlate your CloudWatch logs inside Grafana, deeplinked to the AWS Console. After this major version, you can even wire Jaeger into Grafana, so you have a one stop solution for tracing as well (and logging, if you utilize Loki too).

djmetzle5y ago

I thought this article, while a little dry, was very illuminating. It sounds Hyperfeed is running at the very least "Medium Data" (we all thing our Data is Big!). And i think it is fascinating to hear of a case where Prometheus is plainly a bad fit for it's intend purpose. It sounds like cardinality explosion around their ML models was a really bad fit for Prometheus. Its great to hear about deployments "in-situ", and people appreciating where it works well, and where it doesn't.

j / k navigate · click thread line to collapse

104 comments

75 comments · 11 top-level

halfmatthalfcat5y ago· 13 in thread

Prometheus and Grafana are awesome, use them personally for all my monitoring.

However I’m still trying to nail down my high cardinality/highly unique metrics-like data story. What are people using?

I’ve heard a combination of Cassandra/BigTable and Spark as a potential solution?

latchkey5y ago

I found this interesting. My plan is to move from Prom to Victoria.

https://medium.com/@valyala/measuring-vertical-scalability-f...

akulkarni5y ago

sagichmal5y ago

Woof, good luck. Not a great product.

1 more reply

akulkarni5y ago

TimescaleDB is a long-term storage option for Prometheus metrics, has no problem with high-cardinality, and now natively supports PromQL (in addition to SQL) [0]

(Disclaimer: I work at Timescale)

[0] https://github.com/timescale/timescale-prometheus

chinhodado5y ago

I'm just starting to look into this and have a question. If I can export my metrics directly to TimescaleDB and it supports visualization with Grafana, is there any reason to go through Prometheus?

1 more reply

Legogris5y ago

I'd be curious to hear if anyone has done serious evaluation of high-cardinality use-cases of Victoriametrics.

chucky_z5y ago

jdub5y ago

Honeycomb https://honeycomb.io/ is laser focused on this stuff. They built their own datastore (similar to Druid but schemaless) so they could create the experience they were aiming for.

Nihilartikel5y ago

jpgvm5y ago

jrott5y ago

Spark has worked decently for me if you need to be cloud agnostic.

Currently I’m in AWS land and Athena has been mostly working for what I need but I haven’t really pushed it that hard yet.

sagichmal5y ago

Just curious what your numbers are? Unique metrics, cardinality per metric, ingest rate, expected query ranges?

base6985y ago

I have an instance that scrapes about 30K targets for 15 million metrics and it works better than you'd expect. The biggest performance issue we have is rendering the targets page.

We have a plan to split it down to less instances per node but it's worked well enough so far.

dijit5y ago· 12 in thread

Grafana truly is best in class, but I have strong reservations about Prometheus.

I really want to like it, it’s just so _easy_, publish a little webpage with your metrics and Prometheus takes care of the rest. Lovely.

But I often find that the cardinality of the data is substantially lower than even the defaults of alternatives (influxdb has 1s and even Zabbix has 5s).

Not to mention the lost writes (missing data points) which have no logged explanation.

All of this, however, was in my homelab, which, while unconstrained in resources lacks a lot of the fit and finish of a prod system.

But it has a lot of hype, and maybe I’m holding it wrong.

ecnahc5155y ago

> Erything about the design (polling, single database) tells me that it was designed primarily to sit alongside something small.

Prometheus is designed to be "functionally sharded". You shouldn't be running one "mega prometheus". Often it's something like 1 Prometheus per-team, depending on the amount of metrics each produces.

You can use federation at lower resolutions or a one of the distributed setups (Thanos/Cortex) if you want to avoid extra storage or lower resolution that federation entails.

> But I often find that the cardinality of the data is substantially lower than even the defaults of alternatives

Not to distract, but I think you meant resolution, not cardinality. Cardinality is the metadata like labels/dimensions. Resolution is the granularity in the time.

AlphaSite5y ago

Doesn’t it also have cardinality issues?

2 more replies

Legogris5y ago

[0]: https://github.com/VictoriaMetrics/VictoriaMetrics

dewey5y ago

These might be a good read if you are considering it:

- https://www.robustperception.io/evaluating-performance-and-c...

- https://medium.com/@valyala/evaluating-performance-and-corre...

ashtonkem5y ago

scaryclam5y ago

Being able to also control which metrics are important to my team vs the wider team is a BIG bonus of this sort of decentralised system.

1 more reply

EdwardDiego5y ago

We've used Thanos to aggregate multiple Prometheus (Promethii?) across our clusters to enable us to scale, each Prometheus deals only with a subset of scrape targets.

Biggest issue I've had was an app that was accidentally publishing several thousand metrics which caused the default scrape timeout of 15s to kick in.

fnord1235y ago

>Promethii

Prometheuses.

ii is for latin words. Prometheus is/was Greek. I guess you could use Prometheoí but it would quickly derail any conversation. :)

1 more reply

skohan5y ago

> Grafana truly is best in class

Really? Recently we've been playing with Chronograf with InfluxDB and most people find it a lot nicer to work with than Grafana (specifically because it makes discoverability a lot nicer)

_xrjp5y ago

VectorLock5y ago

Recent Grafana's Explore interface is much nicer.

jtl9995y ago

> Not to mention the lost writes (missing data points) which have no logged explanation.

FWIW I've had similar issues with MySQL backed Zabbix before.

bacondude35y ago· 12 in thread

MetalMatze5y ago

You're probably exactly looking for something like that. In fact, I've given a talk about a similar scenario at the KubeCon San Diego: https://www.youtube.com/watch?v=FrcfxkbJH20

Disclosure: I work on Thanos and Thanos Receiver which implements that protocol.

bacondude35y ago

Oh, I had no idea about that. Thanks!

cassianoleal5y ago

The caveat is that I have no metrics when the laptop is offline but that doesn't happen very often anyway.

[0] https://www.zerotier.com/

bacondude35y ago

That looks like a really cool solution. Thanks for the suggestion!

valyala5y ago

[1] https://victoriametrics.github.io/vmagent.html

[2] https://prometheus.io/docs/operating/integrations/#remote-en...

bacondude35y ago

That looks like a really promising solution. Thanks!

bvrmn5y ago

You can use push approach. Like graphite. Grafana has excellent support of graphite queries. Also graphite has low-footprint implementation https://github.com/baverman/hisser

molecule5y ago

Telegraf + InfluxDB?

abhishekjha5y ago

1 more reply

valyala5y ago

vmagent + victoriametrics :)

cortesoft5y ago

Use a push gateway?

ncrmro5y ago

If you use wireguard you can tell Prometheus to scrape your ip on the wireguard network

site-packages15y ago· 7 in thread

sagichmal5y ago

dxhdr5y ago

> High fidelity operational metrics have a useful half life measured in days or weeks.

jldugger5y ago

> High fidelity operational metrics have a useful half life measured in days or weeks.

1 more reply

codeduck5y ago

7 day retention in Prometheus, pushing to something like VictoriaMetrics for downsampling and long term storage. Prometheus is great for collection but rubbish for managing large data sets

Legogris5y ago

bjakubski5y ago

Depends on your needs really. Some metrics we do (for now) keep indefinitely. We're using Thanos to ship data to bucket in object storage Some metrics we do keep for two weeks only.

heliodor5y ago

Then that's business data not monitoring data. Whole different use case and tools. One loses its value over time, the other doesn't.

Jedd5y ago· 6 in thread

We've got a somewhat similar landscape, on a pretty sizeable network - big investment in Zabbix and looking to move, perhaps slowly and perhaps only in part, towards Prometheus.

Coming from a monitoring system that supports push and pull with elegant auto-discovery, we're struggling to work out a sane architecture around (effectively pull-only) Prometheus.

EdwardDiego5y ago

There's a push gateway: https://github.com/prometheus/pushgateway

Jedd5y ago

There's places that doesn't work well (k8s, f.e.) but for conventional fleet metrics it's difficult to abandon.

1 more reply

goranerik5y ago

What are the benefits or main reason for replacing zabbix with prometheus, especially since zabbix introduced prometheus checks?

Jedd5y ago

Good and valid question.

I expect we won't outright replace, but rather augment, especially in spaces where a host-centric tool like Zabbix isn't ideal.

Partly it's driven by a need to monitor things like k8s (in the form of openshift) and pub/sub systems (eg kafka), and to integrate with other data sources (eg elastic).

valyala5y ago

Take a look at VictoriaMetrics. It supports both pull and push models. It is inspired by Prometheus and it supports PromQL-inspired query language - MetricsQL [0].

[0] https://victoriametrics.github.io/MetricsQL.html

Jedd5y ago

Yup - it's on our radar for evaluation.

cmckn5y ago· 6 in thread

[1] https://cortexmetrics.io/

edoceo5y ago

abhishekjha5y ago

What do you use as a frontend? As far as I could tell grafana free tier doesn’t allow monitoring cluster of servers.

2 more replies

sagichmal5y ago

nielsole5y ago

4 more replies

jeffbee5y ago

Interesting to say that Prometheus is "fledgling". The project is almost 7 years old and the Google thing on which it is based is ~15 years old.

cmckn5y ago

> I first heard about it at KubeCon last fall, and kind of shrugged it off as one of those fledgling...

I didn’t know the age of the project, because I hadn’t heard of it. That’s why I go on to say that in actuality it has a ton of adoption and I’ve had a great experience with it.

kasey_junk5y ago· 4 in thread

But woe betide the team that has to run it as a service. Not that other metrics systems are better but Prometheus can be brutal in that space.

As a ‘squad level’ tool its really good. After that it gets hairy fast.

RhodesianHunter5y ago

>I’m dying for someone to open source a ‘next level’ metrics system

For the time being this is a "full team working full 40 hour weeks for year(s)" problem, so I'd be shocked to see it done open source.

valyala5y ago

Could you give more context on Monarch or Circonus features that are missing in Prometheus?

BTW, I'm working on VictoriaMetrics - open source monitoring solution that works out of the box. See https://github.com/VictoriaMetrics/VictoriaMetrics

kasey_junk5y ago

1) histograms as the basic primitive. 2) bidirectional transport 3) runtime configurable filtering at source, collection and sync. 4) provenance as part of the transport.

roskilli5y ago

M3 is meant to be an open source, central, horizontally scalable metrics store - but your mileage may vary. Either way, check it out: https://m3db.io

ablekh5y ago· 2 in thread

icelancer5y ago

Yeah, odd post that talks a lot about Grafana and visualizations then uses absolutely none.

ablekh5y ago

Yes, it's quite ironic. Hopefully, the authors will eventually improve the post, because core content is valuable.

apihealth5y ago· 1 in thread

valyala5y ago

Probably vmagent could be useful for your case? See https://victoriametrics.github.io/vmagent.html#use-cases

osn93637395y ago· 1 in thread

uaas5y ago

djmetzle5y ago

j / k navigate · click thread line to collapse