Having dealt with Prometheus (+Thanos) / Grafana / OTEL and other stacks (e.g: custom solution on ClickHouse, Victoria{Metrics,Logs}, Jaeger/Tempo, Loki, ...) and even cloud ones (Google's Monarch rebranded as Prometheus)... what's your selling point? This to me seems like yet another way to re-invent the wheel.
If it's just for running locally, okay, fine, but when it comes to production (where the stack really matters) at scale, you end up with lots of tradeoffs and approaches.
Why is this one a winning one compared to the overwhelming "competition"? Seems like we're re-inventing the wheel for the 100th time instead of focusing on unifying the efforts in making the existing solutions better. Thankfully we now have OTEL, so at least the interoperability part is somewhat solved (or mitigated)
Last Dec I had a customer complaint, took me 2 days to find the issue. I had to pay $800 for Sentry and a bit more for New Relic. The issue was a locking problem that happened only in very very specific cases, erroring in diff places and timing out in others, unfortunately power users were running into it often. I had two systems, no SLO to catch this and they were completely disconnected. Super annoying.
Anyhow, I spent a day looking at those and eventually went, screw this, I'm gonna just make this actually work. So I spent a few hours, hooked it up, no auth or anything nice, pulled the traces and found the issue. Turns out it was locking due to a long transaction existing in a scheduled task, it existed for years.
The big things for me is it automatically flagging issues, prioritizing them and taking into account: errors, response codes, timing. That's why I'm making it, no venture capital, funded by actual revenue from the start (not paying for Sentry or New Relic anymore). It's really a dev focused tool to help smallish teams find and fix issues before customers even have time to complain.
Anyhow, hope that explains it, kinda related to cloud costs, mostly just my personal frustration with existing tools. Also I did NOT want to host a 5 service stack (grafana, otel collector, prometheus, mimir, loki, k8s) for something that can be done in a 60mb go binary that runs on a 3$ server...
A lot of tools in this space, most pretty good. The goals when I started Traceway were: - simple to host and reason about - cheap to host - comes pre configured for sub 15 dev teams - completely open source, no paid ad-ons
It's not aimed at teams that can afford SREs (yet), the idea was to provide a good tool for smaller teams and startups in the sub 15 dev range.
The base of Traceway is Clickhouse, nothing special there, if you want you can run it with sqlite for self hosting. Sessions are also stored in S3 so the costs are minimal.
It is opinionated, it comes with preconfigured SLOs for flagging issues with endpoints and it will never try to sell you an AI SRE, you can file your exceptions/slo issues with the git integration and run what ever AI you want on it (I was sick of observability tools trying to sell me an AI). The goal is to have a one line setup, for OpenTelemetry, that gets you everything you need in Traceway without anything needing to be additionally configured. It's Datadog/Sentry but combined and fully open sourced.
I'm a huge fan of open source, here is what we've done so far for making existing solutions better:
1 - Session Replays/RUM
Session replays are usually a premium/expensive feature. With Traceway you can self host them and add them to your app in minutes. I am working on making this a standalone feature that ties into the otel sdks for mobile/js so that you can get your spans/logs/metrics/exceptions from any platform connected to your session replays in Traceway. At one point I got nerd snipped into making it work with Flutter, so we are the only solution I know of that has affordable usable session replays for Flutter.
2 - Symfony Otel
Symfony, the php framework, had no library that offered a few line setup and worked out of the box with open telemetry. We wrote one, you can use it with any tool out there.
3 - Symbolicator
We're working on a symbolicator that will be Open Telemetry Collector compatible, so that you can get your stack traces for Js/Flutter/Android/iOS resolved back. From what I can tell no good solution exists for this currently.
I will make a proper HN post at some point with more info on the project, right now I am focusing on building. If you have any ideas or things you'd like to see feel free to comment, join our discord community or open the issue in our git, we're always happy to accept PRs.
VictoriaMetrics and Logs have worked fine in my quiet homelab, and VictoriaMetrics appeared to work great for the infrastructure team of an open source online video game I contribute to (say about 10 physical nodes and 20 applications/services ) ... I was going to suggest VictoriaLogs to them next but wanted to ask what roadblocks could come up.
I thought how come no one is trying to solve this problem. It looks like it's just a matter of time.
With that being said, my experience can be very skewed since prepbook is a passion project running on a VPS with essentially 0 scale. All I care about is the UX of the stack, not scale. Just for context.
What exactly were you struggling with when it came to the setup? Just a ton of new concepts to learn which took time, or something specific to Grafana/Prometheus/Loki?
I'm working on a comprehensive benchmark of Traceway performance on different hardware configurations. The most I've tested with was the smallest managed ch instance with 250k traces per sec, handled it without a hiccup (but that's empirical). You can checkout the traceway git, there is an issue I've opened for benchmarking and you can subscribe/comment on it if you're interested. I'm benchmarking across sqlite, self hosted clickhouse and managed clickhouse. I am a huge fan of systematic, realistic and most of all reproducible benchmarks, so I am really excited about the progress on that.
Anyhow, you can checkout traceway and see what it offers, it's aimed at providing SLOs out of the box, session replays, alerting, configurable dashboards and great exception tracking (automatic symbolication) etc...
Production? As soon as you scale you need a proper solution. Prometheus (by itself) doesn't scale - you need Mimir or Thanos (or similar).
Clickhouse (the "clickstack") seems to be the new kid on the block. Looks very promising.
I have a background in having done a lot of stuff on the Elastic stack related to this; including setting up a big Elastic Fleet based stack for one client at some point. It might not be the cheapest, but it does provide awesome filtering and querying capabilities. However, a lot of teams that use it don't really know how to tap into that capability so it tends to be overengineered for what it does in the end. And the extra, underutilized complexity is why a lot of teams are wary of dealing with that stack.
Storing the data is the easy part but what's the point if you can't run queries against it and produce dashboards and diagnostic tools that actually help you? Prometheus/grafana or older graphite type setups tend to be compromises where you get lots of data but are then limited on the querying front or the number of metrics. The tradeoff is always between scale and querying flexibility. If you store tens/hundreds of GB of telemetry per day, you need a way to make sense of it. Clickhouse seems to be quite good at scaling and querying. It's basically a column database. I don't have direct experience with Loki.
But in the end, all that power only matters if people actually use it. And, again, in my experience teams tend not to. They tend to have a lot of unrealized aspirations around their tools and infrastructure. If it's just a dumping ground for data + a few simplistic dashboards, optimize for that. A lot of that data is actually only kept for compliance/auditing reasons. For that, querying is usually a secondary concern and it's OK if queries take a bit longer and are less powerful.
Traceway has custom dashboards, supports otel logs/traces/metrics/exceptions fully, has session replays for web and flutter (working on ios/android now), has alerting integrations with slack/email/github, oauth login w google/github, and a bunch of other features... All MIT. None behind a paywall.
It has a specific set of trade offs, those are by design, but I am also always open to changing them and improving it. If you try it and have any thoughts the git issues are constantly monitored.
In reality it's a very modular system, the telemetry repositories can be swapped out easily, I have implemented a clickhouse and a sqlite version (to simplify self hosting) so adding a loki like repository would be a breeze. It's not on the roadmap currently as I am putting a lot of effort into 3 diff parts rn.
The truth is that Clickhouse is an incredible DB that scales really well for observability data.
But you can't beat the excellent price and performance. Does what I need and much more
https://github.com/plausible/analytics
Elixir.
- ClickStack (ex HyperDX) - SigNoz - Traceway - a few more
does someone has enough feedback on those to be able to tell which one works best?
I believe what makes it work is the glue, ie. what it integrates with: connect with application logging (log4j), notification with slack.
A few years back I was working with splunk (but this is another galaxy when it comes to cost) and ELK stack. But this was only for logging, not full observability.
I have not used SigNoz or ClickStack. I believe both are very good products that focus on slightly different things.
With Traceway I am trying to focus on providing a pre configured system that works out of the box, tells you whats wrong and what to fix. It comes with a great issue tracker, session replays/RUM, preconfigured Dashboards and it's easy to host. It has an alerting integrations with Slack and Github. The idea is to be proactive rather than reactive when you start growing, so rather than waiting for a failure to build out an SLO it comes with them included.
Based on what you're looking for Traceway may or may not be the best option for you, but all feedback is welcome and I am working on improving it every day. You can checkout the github + it's super easy to self host and I am always down to chat about how it works in the Traceway Discord.
Traceway is fully OTel compliant.
Go: The original version started with Go SDKs. I've since moved to using Go OTel. I haven't updated those docs yet because the Go SDKs still work and are used in the wild, but they're on the deprecation track. Thanks for pointing it out.
Symfony: There were no good one-line OTel integrations out there for Symfony, so we wrote one. It is not a custom SDK, it's an OTel configurator. You can use it with any backend, not just Traceway. We're firm believers in contributing back to the OpenTelemetry community.
Frontend / mobile: This is more complicated. The current frontend and mobile OTel spec does not allow session replays to be sent, so for those platforms we still keep SDKs with a custom protocol alongside OTel. As soon as the spec matures I'm hoping to move it fully to OTel.
Trust is hard to earn and that is why everything I have done with Traceway has been and will continue to be open source. Traceway cloud currently has 3 enterprise customers that are using it and about 50 on lower tier plans. It's an actual live product.
The marketing website has been fully vibe coded, I have too much on my plate right now and I'm not great at designing marketing pages. At some point I'm planing to rewrite it, it has been what most people have complained about, I just have too many things that I need to finish first in the actual product.
I use claude code periodically, other than telling you to checkout my git commit history for the last 10 years there is not much more I can do. The amount of commits this year has not been any greater than before. I don't think I'm pushing on getting things out too quickly or with lower quality.
If you want to read an engineering article I've written recently to see how I approach things here is one I am proud of: https://medium.com/@dusan.stanojevic.cs/flutter-session-repl...
Other than that I just have to continue building, there is nothing else I can do, but I understand where you're coming from and I think that your concern is absolutely valid.
I saw it recently, I think it looks amazing, I haven't looked into it enough to know of any downsides. I am currently heads down in building as I have the roadmap cut out for the next few months, I will circle back to them as soon as I have a bit more time.
If you're familiar with their platform feel free to checkout Traceway and let me know if there are any incredible features you'd like to see in Traceway or anything they're missing. I am always looking for feedback!
Unfortunately my account is being rate limited and I can't response to each comment.
Thank you for your support the attention project has received has been unreal.
I'll be responding to everyone as the rate limit subsides but I've made this in the meantime: https://github.com/tracewayapp/traceway/blob/main/HN.md
Again, thank you for your support!
If I could turn back the clock a year back, and evaluate the tool you are proposing against what happened, here's how it would look like.
The first thing we had problems with was performance. We moved to smaller Hetzner Cloud machines and split a multi-tenant bare metal systems to fine-grained virtual machines. Having no metrics, meant that we were absolutely blind time-wise. We could log into console and issue 'top', but we couldn't do this after the fact. Decision: self-hosted graphite. I see you have metrics in tracewayapp -> +1.
Now 2 months fast forward. The second problem we had was stability. Because we now moved from a single stable machine to around 200 unstable cloud machines, we had no idea which system is up and which isn't. We did a research of how to outsource uptime. We had online meetings with sales teams of uptime.com and uptimerobot. The initial cost was doing two of those 30-minute sales/fit calls. But that's marginal. The real cost would be something that they price if I remember correctly 1USD/probe/mo. We'd need 200 probes by their definition. The pricing is what killed the deals. Decision: self-host uptimekuma. Initial cost of in-house setup and then just the cost of the smallest hetzner machine which is 2.99EUR/mo. We heavily rely on uptimekuma->slack integration for notifications. I see no uptime tool in tracewayapp -> -1.
Another 3 months have passed. We stopped looking at graphite dashboards on day-to-day basis. Natural human optimization. Systems started going down because running out of disk space, or bugs that exhausted connection pools to the database (twice a month for one of the biggest customers). We quickly realized we need threshold notifications based on metrics. Decision: self-host moira. Heavily rely on slack integration. It's hard to find whether there's something similar in tracewayapp -> -1 (correct me if I'm wrong).
Some few months fast forward. Some of the deployments had a bug that resulted in a flood of exceptions. Even though the system was up (uptimekuma green), some critical services were not working. We did a tricky hibernate upgrade and only found out 4 hours after the deployment that the system is not working. Decision: integrate logging with metrics (graphite) and moira to trigger slack notifications when say #errors > threshold. It's hard to say whether this workflow is easily configurable in tracewayapp -> -1.
Can you elaborate more on the points where I perhaps might have rated your app negatively, but you actually actively support these scenarios and the information is burried somewhere deep in your docs (a few pointers would be helpful if this is the case).
What I like about, what looks like a very complex stack I ended up with, is the fact that it works much the same way as UNIX pipes: I can pretty much change one piece in this flow - ie. I am avoiding vendor lock-in.