Skip to content

Top Best Ask Show New Jobs

Traceway: MIT-licensed observability stack you can self-host in ~90s (opens in new tab)

(github.com)

174 pointssebakubisz1mo ago85 comments

85 comments

48 comments · 13 top-level

denysvitali1mo ago· 14 in thread

At KubeCon Europe a very good chunk of booths were observability stacks. Everyone was claiming they're better than the competitors (with some of the just justifying themselves by saying "it's written in Rust).

Having dealt with Prometheus (+Thanos) / Grafana / OTEL and other stacks (e.g: custom solution on ClickHouse, Victoria{Metrics,Logs}, Jaeger/Tempo, Loki, ...) and even cloud ones (Google's Monarch rebranded as Prometheus)... what's your selling point? This to me seems like yet another way to re-invent the wheel.

If it's just for running locally, okay, fine, but when it comes to production (where the stack really matters) at scale, you end up with lots of tradeoffs and approaches.

Why is this one a winning one compared to the overwhelming "competition"? Seems like we're re-inventing the wheel for the 100th time instead of focusing on unifying the efforts in making the existing solutions better. Thankfully we now have OTEL, so at least the interoperability part is somewhat solved (or mitigated)

yuppiepuppie1mo ago

I was thinking this might be a result of the Cheap-money (post covid) era ending and everyone scrambling to reduce their Datadog/Cloud costs. Thinking back on 2023/2024, lots of companies were leaking large amounts of capital to those vendors and I imagine lots of people saw an opportunity for creating leaner and cheaper stacks.

dusanstanojevic1mo ago

No need to guess, I'll tell you the exact story of why I made Traceway!

Last Dec I had a customer complaint, took me 2 days to find the issue. I had to pay $800 for Sentry and a bit more for New Relic. The issue was a locking problem that happened only in very very specific cases, erroring in diff places and timing out in others, unfortunately power users were running into it often. I had two systems, no SLO to catch this and they were completely disconnected. Super annoying.

Anyhow, I spent a day looking at those and eventually went, screw this, I'm gonna just make this actually work. So I spent a few hours, hooked it up, no auth or anything nice, pulled the traces and found the issue. Turns out it was locking due to a long transaction existing in a scheduled task, it existed for years.

The big things for me is it automatically flagging issues, prioritizing them and taking into account: errors, response codes, timing. That's why I'm making it, no venture capital, funded by actual revenue from the start (not paying for Sentry or New Relic anymore). It's really a dev focused tool to help smallish teams find and fix issues before customers even have time to complain.

Anyhow, hope that explains it, kinda related to cloud costs, mostly just my personal frustration with existing tools. Also I did NOT want to host a 5 service stack (grafana, otel collector, prometheus, mimir, loki, k8s) for something that can be done in a 60mb go binary that runs on a 3$ server...

robertlagrant1mo ago

This is my instinct too. I've had the pleasure of using DataDog and the pain of negotiating with their salespeople!

dusanstanojevic1mo ago

Hi, creator of Traceway here. Sorry for the late response, I didn't know this got posted and then my account was rate limiting on comments.

A lot of tools in this space, most pretty good. The goals when I started Traceway were: - simple to host and reason about - cheap to host - comes pre configured for sub 15 dev teams - completely open source, no paid ad-ons

It's not aimed at teams that can afford SREs (yet), the idea was to provide a good tool for smaller teams and startups in the sub 15 dev range.

The base of Traceway is Clickhouse, nothing special there, if you want you can run it with sqlite for self hosting. Sessions are also stored in S3 so the costs are minimal.

It is opinionated, it comes with preconfigured SLOs for flagging issues with endpoints and it will never try to sell you an AI SRE, you can file your exceptions/slo issues with the git integration and run what ever AI you want on it (I was sick of observability tools trying to sell me an AI). The goal is to have a one line setup, for OpenTelemetry, that gets you everything you need in Traceway without anything needing to be additionally configured. It's Datadog/Sentry but combined and fully open sourced.

I'm a huge fan of open source, here is what we've done so far for making existing solutions better:

1 - Session Replays/RUM

Session replays are usually a premium/expensive feature. With Traceway you can self host them and add them to your app in minutes. I am working on making this a standalone feature that ties into the otel sdks for mobile/js so that you can get your spans/logs/metrics/exceptions from any platform connected to your session replays in Traceway. At one point I got nerd snipped into making it work with Flutter, so we are the only solution I know of that has affordable usable session replays for Flutter.

2 - Symfony Otel

Symfony, the php framework, had no library that offered a few line setup and worked out of the box with open telemetry. We wrote one, you can use it with any tool out there.

3 - Symbolicator

We're working on a symbolicator that will be Open Telemetry Collector compatible, so that you can get your stack traces for Js/Flutter/Android/iOS resolved back. From what I can tell no good solution exists for this currently.

I will make a proper HN post at some point with more info on the project, right now I am focusing on building. If you have any ideas or things you'd like to see feel free to comment, join our discord community or open the issue in our git, we're always happy to accept PRs.

NortySpock1mo ago

If I can ask a separate question: what scalability problems did you run into with Victoria{Metrics|Logs|Traces}, and at what scale did you hit them?

VictoriaMetrics and Logs have worked fine in my quiet homelab, and VictoriaMetrics appeared to work great for the infrastructure team of an open source online video game I contribute to (say about 10 physical nodes and 20 applications/services ) ... I was going to suggest VictoriaLogs to them next but wanted to ask what roadblocks could come up.

dusanstanojevic1mo ago

I honestly think you are a bot. When ever I see Victoria mentioned it is always the same, always asking about hitting a scaling problem + promoting it, never responding to any comments. Hope I'm wrong, but it's been one too many. I refuse to use a product that is this dishonest.

yard20101mo ago

I have tried to self host grafana (loki prom and alloy) as o11y stack for prepbook.app. This is hard. I have a bsc in cs not that it says something. I managed to do it eventually, after some research. It was not plug and play in any way. The docs kept saying this solution is not production ready even. I couldn't find the production guide, only the "forget about self hosting and simply pay for us hosting this". After I deployed it the UX was so abrasive my partner won't even try to go into it to figure out a problem. It was a few months ago. Since then new solutions have arrived and I'm waiting to have the time to migrate. I saw PostHog have a solution but I prefer something I could self host and completely own.

I thought how come no one is trying to solve this problem. It looks like it's just a matter of time.

With that being said, my experience can be very skewed since prepbook is a passion project running on a VPS with essentially 0 scale. All I care about is the UX of the stack, not scale. Just for context.

embedding-shape1mo ago

FWIW, I have no CS degree and barely attended school at all, and found Grafana + Prometheus + Loki fairly easy to setup, at least compared to what we used to use before those tools were available. Maybe it's because I used NixOS for the setup, but besides learning some new domain-specific things I didn't know since before, I don't recall hitting any particular bumps or roadblocks, I also went the 100% self-hosted route (spread across two hosts at home).

What exactly were you struggling with when it came to the setup? Just a ton of new concepts to learn which took time, or something specific to Grafana/Prometheus/Loki?

parliament321mo ago

FWIW we've also tried all sorts of different things, and honestly the very vanilla (prometheus -> central thanos, fluentbit -> central loki, grafana) ends up on top. The resource consumption is surprisingly minimal (for a sense of scale, we run about 200k eps for metrics and 1k eps for logs). For all these solutions, I find myself asking the same question as you.. what problem are you trying to solve? Is there anything actually different about your product other than less stability than the battle-tested stack?

dusanstanojevic1mo ago

Hi, sorry for not responding sooner this one slipped through the cracks. I've tried to explain my reasons for starting traceway and what I've been building with it. It's not aimed to be the fastest ingestion tool out there, but it's backed by clickhouse and does minimal processing of the data, you can expect the perf to be as close to clickhouse as possible.

I'm working on a comprehensive benchmark of Traceway performance on different hardware configurations. The most I've tested with was the smallest managed ch instance with 250k traces per sec, handled it without a hiccup (but that's empirical). You can checkout the traceway git, there is an issue I've opened for benchmarking and you can subscribe/comment on it if you're interested. I'm benchmarking across sqlite, self hosted clickhouse and managed clickhouse. I am a huge fan of systematic, realistic and most of all reproducible benchmarks, so I am really excited about the progress on that.

Anyhow, you can checkout traceway and see what it offers, it's aimed at providing SLOs out of the box, session replays, alerting, configurable dashboards and great exception tracking (automatic symbolication) etc...

ting01mo ago

Do you think Prometheus + Grafana is the way to go?

denysvitali1mo ago

Really depends on the use case. Home lab? Probably.

Production? As soon as you scale you need a proper solution. Prometheus (by itself) doesn't scale - you need Mimir or Thanos (or similar).

Clickhouse (the "clickstack") seems to be the new kid on the block. Looks very promising.

CyberDildonics1mo ago

Is "observability stack" the new term for logs and stats?

denysvitali1mo ago

You have more than that nowadays. Tracing and profiling are part of O11y too

tecoholic1mo ago· 7 in thread

I was looking into this just yesterday. So the Loki + … comparison is a bit off in the Open Source space. The main ones are Signoz and ClickStack in this space. Both using ClickHouse as the database. Heavy compared to something like Loki, but they are OTEL native and not log monitoring. So not in the same category.

jillesvangurp1mo ago

I used Signoz + Clickstack on a vibe coded Go server project a few weeks ago. I just made codex figure out how to setup signoz + dependencies via docker compose. I even got it to pre-populate signoz with dashboards. It wasn't too bad. The whole thing runs with a few GB. I tried to cover metrics, tracing, and logging at the same time. This is not a production ready setup but you need to trade off cost vs. utility here. If it's useful enough, that could justify extra cost.

I have a background in having done a lot of stuff on the Elastic stack related to this; including setting up a big Elastic Fleet based stack for one client at some point. It might not be the cheapest, but it does provide awesome filtering and querying capabilities. However, a lot of teams that use it don't really know how to tap into that capability so it tends to be overengineered for what it does in the end. And the extra, underutilized complexity is why a lot of teams are wary of dealing with that stack.

Storing the data is the easy part but what's the point if you can't run queries against it and produce dashboards and diagnostic tools that actually help you? Prometheus/grafana or older graphite type setups tend to be compromises where you get lots of data but are then limited on the querying front or the number of metrics. The tradeoff is always between scale and querying flexibility. If you store tens/hundreds of GB of telemetry per day, you need a way to make sense of it. Clickhouse seems to be quite good at scaling and querying. It's basically a column database. I don't have direct experience with Loki.

But in the end, all that power only matters if people actually use it. And, again, in my experience teams tend not to. They tend to have a lot of unrealized aspirations around their tools and infrastructure. If it's just a dumping ground for data + a few simplistic dashboards, optimize for that. A lot of that data is actually only kept for compliance/auditing reasons. For that, querying is usually a secondary concern and it's OK if queries take a bit longer and are less powerful.

tecoholic1mo ago

I agree. The sentiment applies to most analytics. People who setup analytics are not the same as end users.

dusanstanojevic1mo ago

You're absolutely on point with this, I've made the perf tracking opinionated, so it comes preconfigured with SLOs that are good for most of the projects where nobody would bother to set them up.

Traceway has custom dashboards, supports otel logs/traces/metrics/exceptions fully, has session replays for web and flutter (working on ios/android now), has alerting integrations with slack/email/github, oauth login w google/github, and a bunch of other features... All MIT. None behind a paywall.

It has a specific set of trade offs, those are by design, but I am also always open to changing them and improving it. If you try it and have any thoughts the git issues are constantly monitored.

dusanstanojevic1mo ago

Agreed, it's a trade-off I am ok with for now.

In reality it's a very modular system, the telemetry repositories can be swapped out easily, I have implemented a clickhouse and a sqlite version (to simplify self hosting) so adding a loki like repository would be a breeze. It's not on the roadmap currently as I am putting a lot of effort into 3 diff parts rn.

The truth is that Clickhouse is an incredible DB that scales really well for observability data.

adenta1mo ago

I'm partial to open observe, especially because in Ruby the OTEL stuff isn't great for metrics and logs yet.

lytedev1mo ago

I also run open observe at home, but I can't help but feel that the interface could use some... sparkle, and the mobile experience kinda sucks.

But you can't beat the excellent price and performance. Does what I need and much more

dusanstanojevic1mo ago

When I was starting Traceway I was heavily inspired by skylightio from the Ruby ecosystem. I loved their SLOs/ranking perf issues, but I also wanted the features that Sentry offered in one place.

sgt1mo ago· 4 in thread

Funny, the first thing I look for for infra projects like these is to find out if it's written in Go. At that point, my confidence level is increased.

neya1mo ago

Here's something better than that:

https://github.com/plausible/analytics

Elixir.

ddux13891mo ago

I'm the main contributor to Traceway, I LOVE Elixir! Traceway is strictly for monitoring your app, not the actual usage/product analytics. It's for making sure you know how well your backend is performing and to be able to quickly fix issues that show up.

sexylinux1mo ago

Why is it better? On the internet it is not enough to just say something. You need to deliver some facts and / or a comparison. Please try it.

ddux13891mo ago

Go has been incredible for building Traceway, glad you like it too

oulipo21mo ago· 3 in thread

There's a few contenders in self-hostable otel:

- ClickStack (ex HyperDX) - SigNoz - Traceway - a few more

does someone has enough feedback on those to be able to tell which one works best?

dominikz1mo ago

I have recent experience only with: graphite, uptimekuma, moira - I would certainly recommend for small shops.

I believe what makes it work is the glue, ie. what it integrates with: connect with application logging (log4j), notification with slack.

A few years back I was working with splunk (but this is another galaxy when it comes to cost) and ELK stack. But this was only for logging, not full observability.

dusanstanojevic1mo ago

Hi, creator of Traceway here.

I have not used SigNoz or ClickStack. I believe both are very good products that focus on slightly different things.

With Traceway I am trying to focus on providing a pre configured system that works out of the box, tells you whats wrong and what to fix. It comes with a great issue tracker, session replays/RUM, preconfigured Dashboards and it's easy to host. It has an alerting integrations with Slack and Github. The idea is to be proactive rather than reactive when you start growing, so rather than waiting for a failure to build out an SLO it comes with them included.

Based on what you're looking for Traceway may or may not be the best option for you, but all feedback is welcome and I am working on improving it every day. You can checkout the github + it's super easy to self host and I am always down to chat about how it works in the Traceway Discord.

smuchow19621mo ago

I keep seeing you pop up... good to meet you. I'm new to the platform and find all this very interesting... and to think - I was making a better logger for a godot game...

amne1mo ago· 2 in thread

how can you claim in the readme "no per-language vendor SDK" and then link to a list of per-language client SDKs?

dusanstanojevic1mo ago

Hi, sorry for not responding sooner, didn't realize this post existed.

Traceway is fully OTel compliant.

Go: The original version started with Go SDKs. I've since moved to using Go OTel. I haven't updated those docs yet because the Go SDKs still work and are used in the wild, but they're on the deprecation track. Thanks for pointing it out.

Symfony: There were no good one-line OTel integrations out there for Symfony, so we wrote one. It is not a custom SDK, it's an OTel configurator. You can use it with any backend, not just Traceway. We're firm believers in contributing back to the OpenTelemetry community.

Frontend / mobile: This is more complicated. The current frontend and mobile OTel spec does not allow session replays to be sent, so for those platforms we still keep SDKs with a custom protocol alongside OTel. As soon as the spec matures I'm hoping to move it fully to OTel.

danparsonson1mo ago

Aren't they two different things? Vendor SDKs to get the data in, client SDKs as an option to get the data out?

blazarquasar1mo ago· 1 in thread

Given the heavy LLM usage, i’d probably be a little concerned about the project’s longevity. I personally also can’t stand seeing that typeface on websites anymore…

dusanstanojevic1mo ago

I really did not plan on this to be on HN yet. I think you have a great point and that with all of the projects popping up people should be skeptical of trying things.

Trust is hard to earn and that is why everything I have done with Traceway has been and will continue to be open source. Traceway cloud currently has 3 enterprise customers that are using it and about 50 on lower tier plans. It's an actual live product.

The marketing website has been fully vibe coded, I have too much on my plate right now and I'm not great at designing marketing pages. At some point I'm planing to rewrite it, it has been what most people have complained about, I just have too many things that I need to finish first in the actual product.

I use claude code periodically, other than telling you to checkout my git commit history for the last 10 years there is not much more I can do. The amount of commits this year has not been any greater than before. I don't think I'm pushing on getting things out too quickly or with lower quality.

If you want to read an engineering article I've written recently to see how I approach things here is one I am proud of: https://medium.com/@dusan.stanojevic.cs/flutter-session-repl...

Other than that I just have to continue building, there is nothing else I can do, but I understand where you're coming from and I think that your concern is absolutely valid.

RGJorge1mo ago· 1 in thread

The "easy to set up" framing usually skips the hardest part: whether the metric you're alerting on is meaningful. Most stacks pull container memory from cAdvisor's `container_memory_usage_bytes`, which is the same broken `memory_stats.usage` that `docker stats` reports — includes the kernel's reclaimable page cache. For DB containers with hot working sets, that metric stays at 95%+ constantly. Beautiful Grafana dashboards alerting on a structurally wrong number. The fix is computing real anonymous memory (subtract active_file + inactive_file) — most stacks leave that as a custom exporter exercise. Curious how Traceway handles this out of the box.

sebakubiszOP1mo ago

Curious what LLM model you are.

prabhatsharma1mo ago· 1 in thread

You should take a look at https://github.com/openobserve/openobserve - Extremely performant and simple full-stack observability solution.

dusanstanojevic1mo ago

Creator of Traceway here. Sorry for not responding sooner, didn't realize this HN post existed.

I saw it recently, I think it looks amazing, I haven't looked into it enough to know of any downsides. I am currently heads down in building as I have the roadmap cut out for the next few months, I will circle back to them as soon as I have a bit more time.

If you're familiar with their platform feel free to checkout Traceway and let me know if there are any incredible features you'd like to see in Traceway or anything they're missing. I am always looking for feedback!

ting01mo ago· 1 in thread

This looks cool

ddux13891mo ago

Thank you

ArslanS19971mo ago· 1 in thread

This is awesome bro

ddux13891mo ago

Not the OP, but I am the one making Traceway, thank you

dusanstanojevic1mo ago

Hi, I am the creator of Traceway. I've just realized that someone posted about it.

Unfortunately my account is being rate limited and I can't response to each comment.

Thank you for your support the attention project has received has been unreal.

I'll be responding to everyone as the rate limit subsides but I've made this in the meantime: https://github.com/tracewayapp/traceway/blob/main/HN.md

Again, thank you for your support!

ddux13891mo ago

Hey everyone, I'm the original creator of this project. Just saw this thread, I'll do my best to respond to everyone.

dominikz1mo ago

I was helping a small e-commerce shop moving from Hetzner bare metal to Hetzner Cloud. Initially I thought that the difficulty will be in moving the data, but I got surprised. The difficulty was the fact that the application had absolutely zero observability.

If I could turn back the clock a year back, and evaluate the tool you are proposing against what happened, here's how it would look like.

The first thing we had problems with was performance. We moved to smaller Hetzner Cloud machines and split a multi-tenant bare metal systems to fine-grained virtual machines. Having no metrics, meant that we were absolutely blind time-wise. We could log into console and issue 'top', but we couldn't do this after the fact. Decision: self-hosted graphite. I see you have metrics in tracewayapp -> +1.

Now 2 months fast forward. The second problem we had was stability. Because we now moved from a single stable machine to around 200 unstable cloud machines, we had no idea which system is up and which isn't. We did a research of how to outsource uptime. We had online meetings with sales teams of uptime.com and uptimerobot. The initial cost was doing two of those 30-minute sales/fit calls. But that's marginal. The real cost would be something that they price if I remember correctly 1USD/probe/mo. We'd need 200 probes by their definition. The pricing is what killed the deals. Decision: self-host uptimekuma. Initial cost of in-house setup and then just the cost of the smallest hetzner machine which is 2.99EUR/mo. We heavily rely on uptimekuma->slack integration for notifications. I see no uptime tool in tracewayapp -> -1.

Another 3 months have passed. We stopped looking at graphite dashboards on day-to-day basis. Natural human optimization. Systems started going down because running out of disk space, or bugs that exhausted connection pools to the database (twice a month for one of the biggest customers). We quickly realized we need threshold notifications based on metrics. Decision: self-host moira. Heavily rely on slack integration. It's hard to find whether there's something similar in tracewayapp -> -1 (correct me if I'm wrong).

Some few months fast forward. Some of the deployments had a bug that resulted in a flood of exceptions. Even though the system was up (uptimekuma green), some critical services were not working. We did a tricky hibernate upgrade and only found out 4 hours after the deployment that the system is not working. Decision: integrate logging with metrics (graphite) and moira to trigger slack notifications when say #errors > threshold. It's hard to say whether this workflow is easily configurable in tracewayapp -> -1.

Can you elaborate more on the points where I perhaps might have rated your app negatively, but you actually actively support these scenarios and the information is burried somewhere deep in your docs (a few pointers would be helpful if this is the case).

What I like about, what looks like a very complex stack I ended up with, is the fact that it works much the same way as UNIX pipes: I can pretty much change one piece in this flow - ie. I am avoiding vendor lock-in.

j / k navigate · click thread line to collapse