https://www.atatus.com https://tracetest.io/ https://uptrace.dev/
(I work for Atatus)
I read the horror stories, the monthly bills of 10's of thousands for one server and just assumed there was something more substantial to the product; like they did something groundbreaking or novel. I never cared enough to actually look and see what they did.
I use uptime-kuma - https://github.com/louislam/uptime-kuma - it obviously does a fraction of what these other things do but it does everything I need.
The "omg my bill is out of control" issue is usually manifest from a few sources, one of the biggest is relying heavily on custom metrics added over time, and so you think you're paying X but really you end up paying 2-3X or more by the end of the year. But the tricky thing is, most of those things that cost a lot of money either have a lot of value, or held a lot of value at the time.
(I work for a Datadog competitor)
Learning that datadog charges these as custom metrics was shocking. This opens a wormhole of allow-list or opt-in considerations, and then there is tag cardinality or even introducing a middleware like vector. It feels very backwards to spend effort on reducing observability.
If datadog was crap it would make things easier, but it really is a fantastic product, and a business after all. Prometheus integration is just so very cumbersome which is probably strategic I would imagine.
I'd love an open source alternative. But there just isn't one for APM (which is our main use case). Nothing comes close. Every time I see "OpenTelemetry integration" I just close the page. Hours and hours of manual setup, code pushes, etc while New Relic installs once and works.
I assume it's the same for people who use DataDog begrudgingly.
I used to work on a growing AWS product with tons of features that no one used.
Often when we were creating a feature, our managers would have us include tags and support for making parts of the feature optional, but make sure no parts of the feature (or the feature itself) where optional to start with. We would enable the ability to toggle the feature if "A significant enough amount of customers weighted by revenue requested it".
Also got the "Build filtering, but don't expose it unless we have to".
Grafana, Victoria Metrics or Honeycomb
As opposed to just a stack, they are implementing just about the whole backend shebang from scratch.
As a guy running this for personal infrastructure, I super appreciate the easy install and low system requirements
Others have mentioned Signoz which I have tried but for my homelab it just feels like too much.
OpenObserve might be the solution that I have been looking for.
P.S. I am all for OSS.
Repository: https://github.com/quickwit-oss/quickwit Latest release: https://quickwit.io/blog/quickwit-0.8
In theory, you can get away with just running a Clickhouse instance purely backed by S3 to get durability + scalability (at the cost of performance of course). It all depends on what scale you're running at and the HA/performance requirements you need.
Columnar databases are very good at compressing time series data which often has runs of repeating values that can be run length encoded, repeating deltas (store the delta not the full value), or common strings that can be dictionary encoded. So you can persist a lot of raw data with quite good compression and fast-scannability. Most commercial TSDBs are now backed by column stores. And several now tier with local SSDs for hot data and S3 for colder data.
If that's still too much data to store, you have to start throwing some away. Both sampling and materializing aggregates (and discarding the raw data) are popular techniques and can both be very reasonable trade offs.
It's "low-rank" until that one day systems start shitting the bed, and you're trying to understand what is going on.
OneUptime is 100% open source, and always will be. It's a mono-repo and backend is divided into few services.
Grafana started as visualization tool and has now decoupled multiple products for observability - LGTM stack (Loki for logs, Grafana for visualization, Tempo for traces, and Mimir for metrics). You need to configure and maintain multiple sub-products for a full-stack observability setup.
While grafana stack is great, OneUptime has all of these in one platform and makes it really simple to use.
We also are built natively on OpenTelmetry, use clickhouse for logs / metrics / traces storage - so queries are really fast.
There is no "the solution" to site monitoring. Eventually someone will need scripting to bridge a gap. Preferably python.
I've been working with an enterprise license for a year and while I don't really hear too much about cost, some simple considerations to the design of the infrastructure it was supporting seems to have prevented a ballooning bill (so far).
So for me, not having the engineering time or buy-in to build a whole home grown observability platform by using OSS tools like this (and all the quirks that can come with them) ends up being a lot more expensive than just sucking it up and buying an enterprise plan. At least so far.
If I had the option to do it from scratch how I wanted, with no time or budget constraints, I'd prefer of course not to be beholden to a major SaaS company that charges for ambiguous things that are hard to predict like "per host", because it's quite easy for these services to bury themselves so deep into your infrastructure that you just bite the bullet on whatever inevitable rug pull or price increase comes next. It has happened to me before managing enterprise Hashicorp Vault.
It looks solid and I'd try it if the need arises.
100% of what we do is FOSS.
Great features for logs, metrics & traces, total compatibility with open telemetry, cost optimization tools built in (DataDog leavers typically save around 50%), and much more!
Check out our site, and you can find me on LinkedIn (or indeed reply here !) if you want to ask further questions.