One of the promises of OTEL is that it allows organizations to replace vendor-specific agents with OTEL collectors, allowing the flexibility of the end observability platform. When used with an observability pipeline (such as EdgeDelta or Cribl), you can re-process collected telemetry data and send it to another platform, like Splunk, if needed. Consequently, switching from one observability platform to another becomes a bit less of a headache. Ironically, even Splunk recognizes this and has put substantial support behind the OTEL standard.
OTEL is far from perfect, and maybe some of these goals are a bit lofty, but I can say that many large organizations are adopting OTEL for these reasons.
I think the only issue is that the OpenTelemetry API also includes Metrics and Logs. I just tend to ignore these parts when using OpenTelemetry.
As a backender and half platform engineer I appreciate OTel a lot, it allows me to install OTel ingesting code and it then gets sent to wherever our platform guys and girls think it's best. It allows me to only think about it once and leave the details to the people who have to maintain the infra.
I mean sure, parts (or maybe all?) of the problems in this area have other solutions i.e. we don't use OTel for logging because we already have Grafana + Loki and basically everything every app outputs in stdout / stderr gets captured and can be queried but I like the flexibility for us to fully migrate to all aspects of OTel one day if the scales tilt in another direction.
So what's your beef with all this?
(For the record, I used Sentry many times in the past and I loved it, it's a very no-BS product that I appreciated a lot -- and it adding OTel ingester / collector I viewed as something very positive.)
But I do have to “pip uninstall sentry-sdk” in my Dockerfile because it clashes with something I didn’t author. And anyway, because it is completely open source, the flaws in OpenTelemetry for my particular use case took an hour to surmount, and vitally, I didn’t have to pay the brain damage cost most developers hate: relationships with yet another vendor.
That said I appreciate all the innovation in this space, from both Sentry and OpenTelemetry. The metrics will become the standard, and that’s great.
The problem with Not OpenTelemetry: eventually everyone is going to learn how to use Kubernetes, and the USP of many startup offerings will vanish. OpenTelemetry and its feature scope creep make perfect sense for people who know Kubernetes. Then it makes sense why you have a wire protocol, why abstraction for vendors is redundant or meaningless toil, and why PostHog and others stop supporting Kubernetes: it competes with their paid offering.
That seems obviously true... yet, there are so many people out there that seem unable to learn it that I don't think it's a reliable prediction.
For many applications, it's enough to spin up a VPS/plain Docker container, and it will run fine for many, many years, without adding the Kubernetes complexity on top.
If the application is easy to install and autoconfigures itself, it's even better than having to configure everything yourself or create multi-server Kubernetes clusters.
I wouldn't equate unwillingness or not needing it to inability to learn
Most apps can run fine for millions or hundreds of thousands of user sessions on a $5-$50 VPS. People prematurely optimize for scale, adding a lot of complexity that only makes development slower, and by having more moving parts, there are more things that can break. Start simple. Scaling is mostly a solved problem nowadays, if you quickly need to scale, there are always solutions to do so. In the worst case, you have to scale horizontally, and if you reach the limit of horizontal scaling, other your app is inefficient, or your business is already successful, so you are no longer in the "start" phase.
OpenStandards also open up a lot of usecases and startups too. SigNoz, TraceTest, TraceLoop, Signadot, all are very interesting projects which OpenTelemetry enabled.
The majority of the problem seems like sentry is not able to provide it's sentry like features by adopting otel. Getting involved at the design phase could have helped shaped the project that could have considered your usecases. The maintainers have never been opposed to such contributions AFAIK.
Regarding, limiting otel just to tracing would not be sufficient today as the teams want a single platform for all observability rather than different tools for different signals.
I have seen hundreds of companies switch to opentelemetry and save costs by being able to choose the best vendor supporting their usecases.
lack of docs, learning curve, etc are just temporary things that can happen with any big project and should be fixed. Also, otel maintainers and teams have always been seeking help in improving docs, showcasing usecases, etc. If everyone cares enough for the bigger picture, the community and existing vendors should get more involved in improving things rather than just complaining.
Speaking as one of these maintainers, I would absolutely love it if even half of the vendors who depend heavily on OTel contributed back to the project that enables their business.
My own employer has done this for years now (including hiring people specifically so they can continue to contribute), and we're only at about 200 employees total. I like to imagine how complete the project would feel if Google or AWS contributed to the same degree relative to the size of their business units that depend on OTel.
Of course implementing a spec from the provider point of view can be difficult. And also take a look at all the names of the OTEL community and notice that Sentry is not there: https://github.com/open-telemetry/community/blob/86941073816.... This really isn't news. I'd guess that a Sentry customer should just be able to use the OTEL API and could just configure a proprietary Sentry exporter, for all their compute nodes, if Sentry has some superior way of collecting and managing telemetry.
IMO most library authors do not have to worry about annotation naming or anything like that mentioned in the post. Just use the OTEL API for logs, or use a logging API where there is an OTEL exporter, and whomever is integrating your code will take care of annotating spans. Propagating span IDs is the job of "RPC" libraries, not general code authors. Your URL fetch library should know how to propagate the Span ID provided that it also uses the OTEL API.
It is the same as using something like Docker containers on a serverless platform. You really don't need to know that your code is actually being deployed in Kubernetes. Use the common Docker interface is what matters.
I completely agree. The most charitable interpretation of this blog post is that the blogger genuinely fails go understand the basics of the problem domain, or worst case scenario they are trying to shitpost away the need for features that are well supported by a community-driven standard like OpenTelemetry.
Y’all realize we’d just make more money if everyone has better instrumentation and we could spend less time on it, and more time on the product, right?
There is no conspiracy. It’s simple math and reasoning. We don’t compete with most otel consumers.
I don’t know how you could read what I posted and think sentry believes otel is a threat, let alone from the fact that we just migrated our JS SDK to run off it.
However with sentry it’s still a pain and the visualization in sentry is kinda weird, since it goes beyond tracing.
And since sentry itself has no otel endpoint it is also really hard to do things like tail sampling.
"In 2015 Armin and I built a spec for Distributed Tracing. Its not a hard problem, it just requires an immense amount of coordination and effort." This to me feels like a nice glass of orange juice after brushing my teeth. The spec on DT is very easy, but the implementation is very very hard. The fact that OTel has nurtured a vast array of libraries to aid in context propagation is a huge acheivement, and saying 'This would all work fine if everyone everywhere adopted Sentry' is... laughable.
Totally outside the O11y space, OTel context propagation is an intensely useful feature because of how widespread it is. See Signadot implementing their smart test routing with OpenTelemetry: https://www.signadot.com/blog/scaling-environments-with-open...
Context propagation and distributed tracing are cool OTel features! But they are not the only thing OTel should be doing. OpenTelemetry instrumentation libraries can do a lot on their own, a friend of mine made massive savings in compute efficiency with the NodeJS OTel library: https://www.checklyhq.com/blog/coralogix-and-opentelemetry-o...
OpenTelemetry is not competitive to us (it doesn’t do what we do in plurality), and we specifically want to see the open tracing goals succeed.
I was pretty clear about that in the post though.
I think you, the author, stand to benefit directly from a breakup of OpenTelemetry, and a refusal to acknowledge your own bias is problematic when your piece starts with a request to 'look objectively.'
I quite like the idea of only need to change one small piece of the code to switch otel exporters instead of swapping out a vendor trace sdk.
My main gripe with OpenTelemetry I don't fully understand what the exact difference is between (trace) events and log records.
This is my main gripe too. I don't understand why {traces, logs, metrics} are not just different abstractions built on top of "events" (blobs of data your application ships off to some set of central locations). I don't understand why the opentelemetry collector forces me to re-implement the same settings for all of them and import separate libraries that all seem to do the same thing by default. Besides sdks and processors, I don't understand the need for these abstractions to persist throughout the pipeline. I'm running one collector, so why do I need to specify where my collector endpoint is 3 different times? Why do I need to specify that I want my blobs batched 3 different times? What's the point of having opentelemetry be one project at all?
My guess is this is just because opentelemetry started as a tracing project, and then became a logs and metrics project later. If it had started as a logging project, things would probably make more sense.
By design, they cannot be abstractions of the single concept. For example, logs have a hard requirement on preserving sequential order and session and emitting strings, whereas metrics are aggregated and sampled and dropped arbitrarily and consist of single discrete values. Logs can store open-ended data, and thus need to comply with tighter data protection regulations. Traces often track a very specific set of generic events, whereas there are whole classes of metrics that serve entirely different purposes.
Just because you can squint hard enough to only see events being emitted, that does not mean all event types can or should be treated the same.
In part this is a very practical decision: most people already have pretty good tools for their logs, and have struggled to get tracing working. So it's better to work on tools for measuring and sending traces, and just let people export their current log stream via the OpenTelemetry collector.
Notably the OTel docs acknowledge this mismatch between current implementation and design goals: https://opentelemetry.io/docs/specs/otel/logs/#limitations-o...
The way you process/modify metrics vs logs vs traces are usually sufficiently different that there's not much point in having a unified event model if you're going to need a bunch of conditions to separate and process them differently. Of course, you can still use only one source (logs or events) and derive the other 2 from that, though that rarely scales well.
Plus, the backends that you can use to store/visualize the data usually are optimized for specific signals anyways.
- Trace events (span events) are intended to be structured events and possibly can have semantic attributes behind them - similar to how spans have semantic attributes. They're great if your team is all bought in on tracing as an organization. They will colocate your span events with your parent span. In practice they have poor searchability/indexing in many tools, so they should only be used if you only intend to use them when you will discover the span first. (Ex. debug info that is only useful to figure out why a span was very slow and you're okay not being easily searchable)
- Log records are plain old logs, they should be structured, but don't have to be, and there isn't a high expectation of structured data, much less semantic attributes. Logs can be easily adopted without buying into tracing.
- Events API, this is an experimental part of Otel, but is intended to be an API that emits logs with the expectation of semantic conventions (and therefore is also structured). Afaik end users are not the intended audience of this API.
Many teams fall along the spectrum of logs vs tracing which is why there's options to do things multiple ways. My personal take is that log records are going to continue to be more flexible than span events as an end-user given the state of current tools.
Disclaimer: I help build hyperdx, we're oss, otel-based observability and we've made product decisions based on the above opinions.
It is hard to explain how convenient `tracing` is in Rust and why I sorely miss it elsewhere. The simple part of adding context to logs can be solved in a myriad of ways, yet all boil down to a similar "span-like" approach. I'm very interested in helping bring what `tracing` offers to other programming communities.
It very likely is worth having some people from the space involved, possibly from the tracing crate itself.
(Speaking on behalf of Sentry)
It's not anymore about hey, we'll include this little library or protocol instead of rolling our own, so we can hope to be compatible with a bunch of other industry-standard software. It's a large stack with an ever evolving spec. You have to develop your applications and infrastructure around it. It's very seductive to roll your own simpler solution.
I appreciate it's not easy to build industry-wide consensus across vendors, platforms and programming languages. But be careful with projects that fail to capture developer mindshare.
What difficulties did opting into OTel give you?
From this perspective it doesn't matter if the OTel SDK comes bundled with a bunch of unnecessary code or version conflicts as is suggested in the article. The whole point is to regain control over telemetry & avoid paying $$$ to an ambivalent vendor.
FWIW, I don't think the OTel implementation for mobile is perfect - a lot of the code was originally written with backend JVM apps in mind & that can cause friction. However, I'm fairly optimistic those pain points will get fixed as more folks converge on this standard.
Disclaimer: I work at a Sentry competitor
There's no causal relationships between sibling spans. I think in theory "span links" solves this, but afaict this is not a widely used feature in SDKs are UI viewers.
(I wrote about this here https://github.com/open-telemetry/opentelemetry-specificatio...)
[0]: https://github.com/opentracing/specification/issues/142
Reworking our code to support spans made our stack traces harder to read and in the end we turned the whole thing off anyway. Worse than doing nothing.
- Your SDK's exporter
- Collector processors and general memory limitations based on deployment
- Telemetry backend (this is usually the one that hits people)
Do you know where the source of this rejection happened? My guess would be backend, since some will (surprisingly) have rather small limits on spans and span attributes.
I could for the life of me not get the Python integration send traces to a collector. Same URL, same setup same API key as for Nodejs and Go.
Turns out the Python SDK expect a URL encoded header, e.g. “Bearer%20somekey” whereas all other SDKs just accept a string with a whitespace.
The whole split between HTTP, protobuf over HTTP and GRPC is also massively confusing.
We had to use wireshark to identify a super nasty bug in the “JavaScript” (but actually typescript despite being called opentelemetryjs) implementation.
And OTEL is largely unsuitable for short lived processes like CLIs, CI/CD. And I would wager the same holds for FaaS (Lambda).
In the end I prefer the network topology of StatsD, which is what we were migrating from. Let the collector do ALL of the bookkeeping instead of faffing about. OTEL is actively hostile to process-per-thread programming languages. If I had it to do over again I’d look at the StatsD->Prometheus integrations, and the StatsD extensions that support tagging.
Not necessarily true f.ex. in one of my hobby Golang projects I found out that you can cleanly shutdown the OTel collector so it flushes its backlog of traces / metrics / logs so I was able to get telemetry reading even for CLI tool invocations that lasted 5-10 secs (connect to servers, get data, operate on it, put it someplace else, quit).
But now that you mention it, it would be nasty if that's not the default behavior indeed.
> OTEL is actively hostile to process-per-thread programming languages
Can you explain why, please?
https://github.com/open-telemetry/opentelemetry-specificatio...
There are more. This is a symptom of a how hard it is to dive into Otel due to its surface area being so big.
That sounds like every single run-of-the-mill internship.
I think Go was smart to make this concept part of the standard library, as it encouraged frameworks to adopt it as well.
Every time I tried to use OT I was reading the doc and whispering "but, why? I only need...".
But this ain’t it. In the opening paragraphs the author dismisses the hardest parts of the problem (presumably because they are human problems, which engineers tend to ignore), and betrays a complete lack of interest in understanding why things ended up this way. It also seems they’ve completely misunderstood the API/SDK split in its entirety - because they argue for having such a split. It’s there - that’s exactly what exists!
And it goes on and on. I think it’s fair to critique OpenTelemetry; it can be really confusing. The blog post is evidence of that, certainly. But really it just reads like someone who got frustrated that they didn’t understand how something worked - and so instead of figuring it out, they’ve decided that it’s just hot garbage. I wish I could say this was unusual amongst engineers, but it isn’t.
That’s kind of making my point for me fwiw. It’s too complicated. I consider myself a product person so this is my version of that lens on the problem.
I’m not dismissing the people problem at all - I actually am trying to suggest the technology problem is the easier part (eg a basic spec). Getting it implemented, making it easy to understand, etc is where I see it struggling right now.
Aside this is not just my feedback, it’s a synthesis of what I’m hearing (but also what I believe).
> it just reads like someone who […] didn’t understand how something worked - and so instead of figuring it out, they’ve decided that it’s just hot garbage.
And what about average developers asked to “add telemetry” to their apps and libraries? Their patience will be much lower than that.
Not necessarily defending the content (frankly it should have had more examples), but I relate to the sentiment. As a developer, I need framework providers to make sane design decisions with minimal api surface, otherwise I’d rather build something bespoke or just not care.
This is a gross over-simplification that will leave you with a very skewed view of reality. As a programmer I only ever had to add a library, configure the OTLP endpoint details (host, port, URI, sometimes query parameters as well) and it was done.
It might be "complex and overengineered" if you want to contribute to the OTel libraries but as a programmer-user you are seeing practically none of it. And I would also challenge the "complex and overengineered" part but for now I am not informed enough to do it.
Otelbin [0] has helped me quite a bit in configuring and making sense of it, and getting stuff done.
This is what happens when you use a tool designed for authoring code to also author content.
i.e. "poor grammar unintentionally exposed unclear thinking"
OP (rightfully) complains that there is a mismatch between what they (can) advertise ("We support OTEL") and what they are actually providing to the user. I have the same pain point from the consumer side, where I have to trial multiple tools and service to figure out which of them actually supports the OTEL feature set I care about.
I feel like this could be solved by introducing better branding that has a clearly defined scope of features inside the project (like e.g. "OTEL Tracing") which can serve as a direct signifier to customers about what feature set can be expected.
I do agree that logging and spans are very similar, but I disagree that logs are just spans because they aren't exactly the same.
I also agree that you can collect all metrics from spans and, in fact, it might be a better way to tackle it. But it's just not feasible to do so monetarily so you do need to have some sort of collection step closer to the metric producers.
What I do agree with is that the terminology and the implementation of OTEL's SDK is incredibly confusing and hard to implement/keep up to date. I spent way too many hours of my career struggling with conflicting versions of OTEL so I know the pain and I desperately wish they would at least take to heart the idea of separating implementation from API.
You can burn a lot of money with logs and metrics too. The question is how much value you get for the money you throw on the burning pile of monitoring. My personal belief is that well instrumented distributed tracing is more actionable than logs and metrics. Even if sampled.
(Disclaimer: I work at sentry)
Even if you don't want to consider the privacy concerns: telemetry wastes quite some data of your internet connection.
About the only "privacy concern" with otel is that you are probably shipping traces/metrics to a cloud provider for your internal applications. This isn't the sort of telemetry getting baked into ms or google that is used to try and identify personal aspects of individuals, this is data that tells you "Foo app is taking 300ms serving /bar which is unusual".
I am not sure if it will support session replays like some vendors like Sentry or New Relic offer. Technically, I think session replays (rrweb etc) is pretty cool but as a web visitor I am not a fan.
That said, I think this rot comes from the commercial side of the sector -- if you're a successful startup with one product (e.g. graphing counters), then your investors are going to start beating you up about why don't you expand into other adjacent product areas (e.g. tracing). Repeat previous sentence reversed. And so you get Grafana, New Relic, et al). OpenTelemetry is just mirroring that arrangement.
2. I honestly think the main reason otel appears so complex is the existing resources that attempt to explain the various concepts around it do a poor job and are very hand-wavey. You know the main thing that made otel "click" for me? Reading the protobuf specs. Literally nothing else explained succinctly the relationships between the different types of structure and what the possibilities with each were.
> Logs are just events - which is exactly what a span is, btw - and metrics are just abstractions out of those event properties. That is, you want to know the response time of an API endpoint? You don't rewind 20 years and increment a counter, you instead aggregate the duration of the relevant span segment. Somehow though, Logs and Metrics are still front and center.
Is anyone replacing logs and metrics with traces?
The main argument for metrics beyond traces is simply a technology implementation - its aggregation because you cant store the raw events. That doesnt mean though you need a new abstraction on those metrics. They're still just questions you're asking of the events in the system, and most systems are debuggable by aggregation data points of spans or other telemetry.
As for logs, they're important for some kinds of workloads, but for the majority of companies I dont think they're the best solution to the problem. You might need them for auditability, but its quite difficult to find a case where logs are the solution to debug a problem if you had span annotations.
Isn’t this exactly what the SpanExporter API is for? This is in the Go SDK, I suppose it may not be available in other SDKs.
I have used this API to convert OTel spans into log messages as we currently don’t have a distributed tracing vendor.
I don’t follow closely enough to comment on possible causes.
What I do know is that the surface area of code and infrastructure that telemetry touches means adopting something unfinished is a big leap of faith.
Asking because some pieces, like the Collector, aren't technically a stable 1.0 yet, but the bar for stability is extremely high, and in practice it's far more stable than most software out there.
But there are other pieces, such as a language's support for a specific concept, that are truly experimental or even still in-development.
I suspect OP is seeing this directly when talking about the cludgyness of the Javascript API.
The flexibility benefits vendors (I work for HyperDX, based on otel) - as it allows for a lot of points of extensibility to build a better experience for end users by extending the vanilla SDK functionality. However, it creates a lot of overhead for end-users trying to adopt the "vanilla" SDKs out of the box as there's 5 layers of abstractions that need to be understood before getting things started (which is bad!)
I've only seen the DX of Otel improve over time across the ecosystems they support - so I suspect we'll get there soon enough.
The simple API they describe is basically there in OTel. The API is larger, because it also does quite a few other things (personally, I think (W3C) Baggage is important too), but as a library author I should need only the client APIs to write to.
When implementing, you're free to plug in Providers that use OpenAPI-provided plumbing, but you can equally well plug in Providers from DataDog or Sentry or whatever.
Unless I'm missing something, any further complaints could be solved by making sure the Client APIs (almost) never have backward-incompatible changes, and are versioned separately.
OTLP imo doesn’t even need to be part of the spec.
But minimal would also mean focusing on solving fewer problems as a whole. Eg OpenTracing plus OpenMetrics plus OpenLogs. I only need one of those things.
OTLP has been quite useful especially in metrics to get a format that doesn't really have any sacrifices/limitations compared to all the other protocols.
I've always wondered, what's the point of the trace ID? What even is a trace?
- It could be a single database query that's invoked on a distributed database, giving you information about everything that went on inside the cluster processing that query.
- Or it could be all database calls made by a single page request on a web server.
- Or it could be a collection of page requests made by a single user as part of a shopping checkout process. Each page request could make many outgoing database calls.
Which of these three you should choose merely depends on what you want to visualize at a given point in time. My hope is that at some point we get a standard for tracing that does away with the notion of trace IDs. Just treat everything going on in the universe as a graph of inter-connected events.
It does have one positive benefit beyond that. If you lose data, or have disparate systems, its pretty easy to keep the Trace ID intact and still have better instrumentation than otherwise.
buried the lede!