Skip to content

Top Best Ask Show New Jobs

Show HN: Distributed Tracing Using OpenTelemetry and ClickHouse (opens in new tab)

(github.com)

105 pointsvmihailenco4y ago41 comments

41 comments

33 comments · 12 top-level

debuggerpk4y ago· 5 in thread

how is different from datadog ?

As far as I understand the license you're allowed to self host this for free, so long as you're monitoring your own stuff rather than reselling monitoring as a service.

I can see this being very attractive for side projects / early stage bootstrap stuff where you may not be able to afford something like datadog.

On the flip side, there is something to be said about someone else hosting your monitoring systems, as hopefully if you have a massive outage the third party system will still be up.

I'd imagine that you could achieve this here by starting off self hosted and then migrating to their cloud offering once your systems were critical enough to justify it.

You're right that this is a pretty crowded space though - look forward to seeing how they do

vmihailencoOP4y ago

You are right about the license - it allows everything Apache 2.0 allows except reselling monitoring as a service.

vmihailencoOP4y ago

Well, I had a delusion that Uptrace will have clean and simple UI, but I guess for others the UI is just as confusing. :(

I think we've done pretty good job with filtering+grouping+aggregation and data exploration in general. That is something I am proud of.

Uptrace is significantly cheaper.

As for the rest, it is the same but different. DD is bigger and more complex. I guess that is not the problem when you get used to the UI.

We have been using your hosted service for a while now and we are perfectly happy with uptrace, compared to jaeger the UI is miles ahead, I like jaeger for what it brought but the ui is just not very good.

We tried other hosted services but most if not all of them consider you are creating gold from thin air and charge you accordingly, we started with a hosted jaeger solution and switched to uptrace without looking back.

tarun_anand4y ago

For an open source young project, your UI is pretty good. In fact that is what got me here to comment!

Any mobile client sdk? Android/iOS you are aware of?

d-fal4y ago· 3 in thread

This is quite cool piece of work. Would you please make a comparison between uptrace and jaeger?

vmihailencoOP4y ago

I will try, but I don't know that much about Jaeger and it is hard to be unbiased.

UI-wise, Uptrace does not have service dependency analysis, because I don't find it very useful/interesting (let me know if I am wrong). I believe Jaeger does not have span grouping, percentiles, and filters/aggregation are much much simpler.

Jaeger has remote sampling which can be a powerful feature if users are ready to spend their time configuring it. Would be nice to hear if that is the case and if many people are using it.

I've also seen some work on in-memory tail-based sampling, but I don't know if that is ready to use or not. I plan to add tail-based sampling to Uptrace too.

That is about as much I know :)

atombender4y ago

We use remote sampling. I don't have a ton of experience with Jaeger, but I've used it a lot, and implemented OpenTracing in a dozen or so apps.

The benefit of remote sampling, as I understand it, is that the sample rata doesn't need to be configured in each app. So in principle, you could have a little slider in the UI to adjust the sample rate for each individual app, perhaps increasing it during a particular incident to capture more traffic data.

Without remote sampling, you'd have to hard-code the rate into the app's config or code, which then requires a redeploy to change the setting. With a lot of microservices to maintain, it seems a lot simpler to have a central location for such settings.

Uptrace looks fantastic, by the way. It's about time someone gives Zipkin and Jaeger some competition. And Clickhouse makes a lot more sense as a backend than Cassandra.

d-fal4y ago

Yes you are mostly right, we can open an issue on github and try to compare them, if you agree. This would be among the first things a user wants to know before picking up uptrace.

mritchie7124y ago· 2 in thread

Nice, I'm using Clickhouse to build Luabase[0]. I saw "ClickHouse cluster support" on your roadmap, are you just using a single node right now? How are you handling persistent storage?

0 - https://luabase.com/

vmihailencoOP4y ago

By cluster support I mean:

- ability to use ReplicatedMergeTree in the table schema

- round-robin writing to multiple nodes

It is mostly a matter of providing configuration options. Thought I could skip it in the first release.

>How are you handling persistent storage?

If you mean avoiding data loss by using ClickHouse cluster, then yes - we use CH cluster and replication :)

ClickHouse handles data corruption surprisingly good - even if there are broken parts CH continues to serve the rest of data.

mritchie7124y ago

Where is the clickhouse data stored, in the Docker container?

For reference, here's what I'm using: https://github.com/Altinity/clickhouse-operator/blob/master/...

tomnipotent4y ago· 2 in thread

Is data being inserted into CH as it's received, or is there an intermediary buffer? A general overview of the flight of telemetry data through the system would be very welcome.

vmihailencoOP4y ago

The data is received via OTLP (Otel protocol) and almost immediately inserted into ClickHouse buffer table in small batches. Simple and very efficient.

Tail-based sampling will require buffering spans in memory for some time, but tail-based sampling is not implemented yet.

Cloud version also uses Kafka to survive surges in traffic, but I guess "personal" / company version does not need that as much. So no need to introduce additional dependency.

lma214y ago

For tail-based sampling, does it mean that every process in a trace will keep its spans in memory until the initial process 'ends' the trace? How does the flushing happen (e.g. all processes 'commit' their buffer spans)? Many thanks for the explanations!

rad_gruchalski4y ago· 2 in thread

This solution has some great ideas but there’s always “but”.

Main things distinguishing this from Jaeger are on the storage side. Jaeger has pluggable span writers and readers so it would be possible to do the same right in Jaeger.

The UI part is probably more work than the actual storage but the default Jaeger UI is anyway not the main tool people tend to work with.

vmihailencoOP4y ago

"But" do you need another storage? :)

I understand that it is not ideal to have so many competing tools, but contributing to an existing mature project is a nightmare. It is by far easier to start a new one.

>Jaeger UI is anyway not the main tool people tend to work with.

Which tools / features do you have in mind?

Uptrace OS competes with Jaeger / Zipkin / SigNoz / SkyWalking and I believe it already does a pretty good job.

rad_gruchalski4y ago

We use a custom elastic storage on steroids with Keycloak integration for enabling multi-tenancy in Jaeger so we can do SLA tracking and reporting. So the answer is yes.

I get your point about contributing, especially features that are incompatible with the maintainer vision. Feature creep, right?

What I value in open source projects is extensibility. Plugins which one can maintain outside of the main product.

> "But" do you need another storage? :)

I’m only saying that it’s possible. I might not need it but if someone does and they want to self host it as a managed solution, it can be done right in Jaeger.

> Which tools / features do you have in mind?

The default Jaeger UI isn’t really ergonomic. Trace info is more useful in the context of other information. As in, tools pulling trace info out of storage and overlaying on other data. There’s also Grafana Tempo.

pachico4y ago· 2 in thread

Now that ClickHouse has released async writes you won't need to use the buffer engine anymore, which has been always a non recommended solution anyway.

vmihailencoOP4y ago

Maybe, are you already using that feature in production? Buffers are available for years and just work. Hard to say how async inserts perform in real applications.

I also use buffer engine, don't get me wrong. The only reason why I'm not using async writes is because I only run Altinity certified versions in my clusters do I'm waiting for it.

joshxyz4y ago· 2 in thread

I wonder how this works well with cloki (loki for clickhouse)

qxip4y ago

cloki can be used to read metrics out of any CH table so it should work fine.

we also just introduced experimental support for ingesting OTLP/ZIPKIN spans and a tempo-compatible API in cloki, looking for testers to validate this feature:

https://github.com/lmangani/cLoki/wiki/Tempo-Tracing#clickho...

Internally trace spans are stored as tagged JSON logs, meaning they are available from both Loki and Tempo APIs and can be used from pretty much any visualization, too!

vmihailencoOP4y ago

Those are 2 separate projects and they don't work together. I still did not have a chance to try loki / tempo so can't say how well they work in practice...

shazzy4y ago· 1 in thread

Uptrace looks really pretty interesting. I particularly like the query language that you can use to query your distributed trace data. This is the biggest limitation I have found with jaeger, lots valuable data is stored in storage, but it's very hard to analyze in aggregate.

For example, a question I want to be able to answer with a query against the distributed trace data: show me the (mean, median) time between a parent http request and a child http request in the same trace tree. As far as I understand, this requires the query language to be able to group by trace id, then be able to identify parent/child relations.

Does the Uptrace query language allow you to do something like this?

vmihailencoOP4y ago

So far my experience is that it is best to avoid trying to solve such problems with a query language and instead provide a much simpler UI to achieve the same. Solving such problems with SQL is tedious enough and learning another custom language is not fun / too much to ask from users.

Sometimes using a UI is not possible, for example, if you want to automate such checks. In that case, I would build a custom metric or two and would use that metric for monitoring purposes. That requires some programming / instrumentation, but it still looks like a better solution to me.

drchaim4y ago· 1 in thread

I've been thinking about mixing OpenTelemetry and CH for a while. There is a wild competition in this area, hope you luck and fun!

vmihailencoOP4y ago

Thanks. The competition is high indeed thanks to OpenTelemetry. Let me know if have any ideas / feature requests - it is interesting how other people see OpenTelemetry + ClickHouse working together.

cpursley4y ago· 1 in thread

Looks interesting. Can you share a couple of common use cases, say - for somebody running a typical saas app? Also, are there plans for an Elixir client?

vmihailencoOP4y ago

You can use Uptrace to monitor routes/queries performance, errors, and partially logs (only Go at the moment and it is unofficial). Distributed tracing excels at finding the root cause and resolving issues in production. That is the future / present of any kind of monitoring and the potential is immense.

>Also, are there plans for an Elixir client?

Uptrace client is a just pre-configured distribution of OpenTelemetry so you try to configure OpenTelemetry https://github.com/open-telemetry/opentelemetry-erlang . We will provide a pre-configured client for Elixir if there is an interest.

vmihailencoOP4y ago

Uptrace is an open source distributed tracing system that uses OpenTelemetry to collect data and ClickHouse database to store it. ClickHouse is the only dependency.

Would be happy to answer your questions here.

debuggerpk4y ago

how is it different from https://signoz.io

signoz datastore is built with cassandra and yours with clickhouse.

j / k navigate · click thread line to collapse