If you're using clouds (AWS/Azure/Google). Datadog can capture all the AWS metadata automatically and merge with existing metrics, so you use instance tags and such for searching and filtering. It can also capture AWS metrics like ELB and S3 usage which are hard to get otherwise.
So you simply get all the metrics you need and get them easily (I appreciate that people who haven't worked with these probably can't fathom what they are missing out). There are defaults charts/dashboards that are quite good and available out of the box, whereas grafana is empty out of the box and you're once again forced to crawl for dashboard plugins.
Last but not least. The capabilities to search and visualize in datadog are incredible. To draw any metrics and combination of metrics in different ways and analyze usage. Prometheus can't chart shit. Grafana has limited charting and you're forced to create a dashboard to make one chart, which can't be done because don't have admin permissions.
By the way prometheus doesn't scale. It can reach 1000 or 2000 hosts top and that's the end of it. I've operated it at the limit, some operations get really slow and we had to cut down on tags and some metrics to avoid crashing.
Interesting, I haven't had this experience. I monitor the DBs and middleware you mention and the OSS plugins + OSS grafana boards worked quite out of the box. For what is worth we have around ~20 different technologies for DB and middleware.
We aren't using cloud since we have our own datacenters so there could be a big difference in usage.
As far as prometheus doesn't scale I don't know I agree. We have more than 5k hosts currently on it and is working fine. We do use some strategies like recorded queries and federation which are well documented.
The cloud does make a difference. Just seeing the daily S3 usage per bucket was life changing. Immediately found that backups were not expiring after a while as they should, costing more and more money. ^^
Do you know how many metrics you are ingesting in prometheus? storage size? and how many tags per host? We were reaching 1 TB of memory usage (mmap) on our server with 1500 hosts. Prometheus was literally grinding to a halt or crashing, was forced to cut down some metrics and stick to the absolute minimum tags.
Try prometheus_tsdb_head_series and prometheus_tsdb_storage_blocks_bytes or du command on the directory.
It was also mind-blowing how things were integrated. For example. See a slow request? Click into the APM trace. Notice a service on that trace being slow? Click onto it, see what host it was running on. From there, another button pulls up all the Docker containers running on the host in that point in time. The CPU usage is visualized - and, aha! We forgot to set a CPU limit on one of those other jobs.
Debugging issues like that would've been nearly impossible otherwise, and we had more than a few cases of that.
As far as speed, I haven't had the issue with prometheus. We use recorded rules for things that benefit from being pre-computed.
I imagine the UX to be quite different by using a product.