undefined | Better HN

0 pointsaprdm5y ago0 comments

Can you expand? As someone who maintains a large-ish prometheus/grafana installation on prems I don't know what we're missing! We have a couple of custom metrics that we developed in prometheus and the OSS plugins/dashboards look great.

0 comments

5 comments · 2 top-level

user59944615y ago· 2 in thread

Datadog has incredible integration with clouds (AWS and other), databases (postgresql, mysql, cassandra) and middleware (haproxy, kafka). It can capture all the metrics from all these out of the box with minimal effort, whereas you have to crawl hundreds of broken plugins for prometheus to get one third of that.

If you're using clouds (AWS/Azure/Google). Datadog can capture all the AWS metadata automatically and merge with existing metrics, so you use instance tags and such for searching and filtering. It can also capture AWS metrics like ELB and S3 usage which are hard to get otherwise.

So you simply get all the metrics you need and get them easily (I appreciate that people who haven't worked with these probably can't fathom what they are missing out). There are defaults charts/dashboards that are quite good and available out of the box, whereas grafana is empty out of the box and you're once again forced to crawl for dashboard plugins.

Last but not least. The capabilities to search and visualize in datadog are incredible. To draw any metrics and combination of metrics in different ways and analyze usage. Prometheus can't chart shit. Grafana has limited charting and you're forced to create a dashboard to make one chart, which can't be done because don't have admin permissions.

By the way prometheus doesn't scale. It can reach 1000 or 2000 hosts top and that's the end of it. I've operated it at the limit, some operations get really slow and we had to cut down on tags and some metrics to avoid crashing.

aprdmOP5y ago

> Datadog has incredible integration with clouds (AWS and other), databases (postgresql, mysql, cassandra) and middleware (haproxy, kafka). It can capture all the metrics from all these out of the box with minimal effort, whereas you have to crawl hundreds of broken plugins for prometheus to get one third of that.

Interesting, I haven't had this experience. I monitor the DBs and middleware you mention and the OSS plugins + OSS grafana boards worked quite out of the box. For what is worth we have around ~20 different technologies for DB and middleware.

We aren't using cloud since we have our own datacenters so there could be a big difference in usage.

As far as prometheus doesn't scale I don't know I agree. We have more than 5k hosts currently on it and is working fine. We do use some strategies like recorded queries and federation which are well documented.

user59944615y ago

The question would be which plugins exactly? because there are tons of plugins spread over GitHub, more or less working, and they're constantly shifting. I've had maybe 15 integrations working perfectly on datadog in 2 weeks of work. The same on prometheus/grafana could have taken 6 months easily (with few exporters to write from scratch).

The cloud does make a difference. Just seeing the daily S3 usage per bucket was life changing. Immediately found that backups were not expiring after a while as they should, costing more and more money. ^^

Do you know how many metrics you are ingesting in prometheus? storage size? and how many tags per host? We were reaching 1 TB of memory usage (mmap) on our server with 1500 hosts. Prometheus was literally grinding to a halt or crashing, was forced to cut down some metrics and stick to the absolute minimum tags.

Try prometheus_tsdb_head_series and prometheus_tsdb_storage_blocks_bytes or du command on the directory.

1 more reply

acid__5y ago· 1 in thread

From my perspective at a smaller startup (so YMMV) after switching to Datadog; the first thing I noticed was how _fast_ Datadog was. The queries and graphing capabilities were also really powerful. Or maybe I just didn't know how to use the old tools, but regardless it was super easy to pick up and do things I struggled with in Prometheus/Grafana.

It was also mind-blowing how things were integrated. For example. See a slow request? Click into the APM trace. Notice a service on that trace being slow? Click onto it, see what host it was running on. From there, another button pulls up all the Docker containers running on the host in that point in time. The CPU usage is visualized - and, aha! We forgot to set a CPU limit on one of those other jobs.

Debugging issues like that would've been nearly impossible otherwise, and we had more than a few cases of that.

aprdmOP5y ago

Yeah that kind of integration seems neat. We use ELK + Prometheus and it does require having Kibana + prometheus open OR building a dashboard in grafana pulling from both sources.

As far as speed, I haven't had the issue with prometheus. We use recorded rules for things that benefit from being pre-computed.

I imagine the UX to be quite different by using a product.

j / k navigate · click thread line to collapse

0 comments

5 comments · 2 top-level

user59944615y ago· 2 in thread

aprdmOP5y ago

We aren't using cloud since we have our own datacenters so there could be a big difference in usage.

user59944615y ago

Try prometheus_tsdb_head_series and prometheus_tsdb_storage_blocks_bytes or du command on the directory.

1 more reply

acid__5y ago· 1 in thread

Debugging issues like that would've been nearly impossible otherwise, and we had more than a few cases of that.

aprdmOP5y ago

Yeah that kind of integration seems neat. We use ELK + Prometheus and it does require having Kibana + prometheus open OR building a dashboard in grafana pulling from both sources.

As far as speed, I haven't had the issue with prometheus. We use recorded rules for things that benefit from being pre-computed.

I imagine the UX to be quite different by using a product.

j / k navigate · click thread line to collapse