E.g. if you get a RPC request coming in, and make an RPC request in order to serve the incoming RPC request. The traced program needs to track some ID for that request from the time it comes in, through to the place where the the HTTP request comes out. And then that ID has to get injected into a header on the wire so the next program sees the same request ID.
IME that's where most of the overhead (and value) from a manual tracing library comes from.
I was hoping odigos was language/runtime-agnostic since it's eBPF-based, but I see it's mentioned in the repo that it only supports:
> Java, Python, .NET, Node.js, and Go
Apart from Go (that is a WIP), these are the languages already supported with Otel's (non-eBPF-based) auto-instrumentation. Apart from a win on latency (which is nice, but could in theory be combated with sampling), why else go this route?
We are constantly adding more language support for eBPF instrumentation and are aiming to cover the most popular programming languages soon.
Btw, not sure that sampling is really the solution to combat overhead, after all you probably do want that data. Trying to fix production issue when the data you need is missing due to sampling is not fun
A lot of Go metrics libraries, specifically Prometheus, introduce a lot of lock contention around incrementing metrics. This was unacceptably slow for our use case at work and I ended up writing a metrics system that doesn't take any locks for most cases.
(There is the option to introduce a lock for metrics that are emitted on a timed basis; i.e. emit tx_bytes every 10s or 1MiB instead of at every Write() call. But this lock is not global to the program; it's unique to the metric and key=value "fields" on the metric. So you can have a lot of metrics around and not content on locks.)
The metrics are then written to the log, which can be processed in real time to synthesize distributed traces and prometheus metrics, if you really want them: https://github.com/pachyderm/pachyderm/blob/master/src/inter... (Our software is self-hosted, and people don't have those systems set up, so we mostly consume metrics/traces in log form. When customers have problems, we prepare a debug bundle that is mostly just logs, and then we can further analyze the logs on our side to see event traces, metrics, etc.)
As for eBPF, that's something I've wanted to use to enrich logs with more system-level information, but most customers that run our software in production aren't allowed to run anything as root, and thus eBPF is unavailable to them. People will tolerate it for things like Cilium or whatever, but not for ordinary applications that users buy and request that their production team install for them. Production Linux at big companies is super locked down, it seems, much to my disappointment. (Personally, my threat model for Linux is that if you are running code on the machine, you probably have root through some yet-undiscovered kernel bug. Historically, I've been right. But that is not the big companies' security teams' mental model, it appears. They aren't paranoid enough to run each k8s pod in a hypervisor, but are paranoid enough to prevent using CAP_SYS_ADMIN or root.)
I think the example you gave for the lock used by Prometheus library is a great example why generation of traces/metrics is a great fit for offloading to different process (an agent).
Patchyderm looks very interesting however I am not sure how you can generate distributed traces based on metrics, how do you fill in the missing context propagation?
Our way to deal with eBPF root requirements is to be transparent as possible. This is why we donated the code to the CNCF and developing as part of the OpenTelemetry community. We hope that being open will make users trust us. You can see the relevant code here: https://github.com/open-telemetry/opentelemetry-go-instrumen...
Every log line gets an x-request-id field, and then when you combine the logs from the various components, you can see the propagation throughout our system. The request ID is a UUIDv4 but the mandatory 4 nibble in the UUIDv4 gets replaced with a digit that represents where the request came from; background task, web UI, CLI, etc. I didn't take the approach of creating a separate span ID to show sub-requests. Since you have all the logs, this extra piece of information isn't super necessary though my coworkers have asked for it a few times because every other system has it.
Since metrics are also log lines, they get the request-id, so you can do really neat things like "show me when this particular download stalled" or "show me how much bandwidth we're using from the upstream S3 server". The aggregations can take place after the fact, since you have all the raw data in the logs.
If we were running this such that we tailed the logs and sent things to Jaeger/Prometheus, a lot of this data would have to go away for cardinality reasons. But squirreling the logs away safely, and then doing analysis after the fact when a problem is suspected ends up being pretty workable. (We still do have a Prometheus exporter not based on the logs, for customers that do want alerts. For log storage, we bundle Loki.)
As for the original post, opentelemetry is forced to be relatively slow because of a huge amount of semantic conventions that are meant to make data more useful. I won't go into the legitimacy of that, but while I haven't been able to verify the data this solution records, it is very unlikely to be recording as much information. Manual instrumentation would never loose to eBPF in principle, at least in a compiled language like Go, but eBPF does have great potential to perform better than OTel while recording far less data. Then comes blog post, users giving the keys to their kernel, and data ending up in the hands of an enemy state. I doubt that's the case this time but it's only a matter of time.
Banking apps if you see this, please just instrument your code. Thank you.
By definition, at 99th percentile, if I have 100 page loads, the one with the worst latency would be over the 99th percentile. That's not 85.2%, 87.1%, 67.6%, etc. The formula shown in that column makes no sense at all.
We have a similar chart at my job to illustrate the point that high p99 latency on a backend service doesn't mean only 1% of end-user page loads are affected.
Could you please elaborate on a few more details about your benchmark?
- Did you measure the CPU usage of the eBPF agent?
- How does Odigos handle eBPF's perfmap overflow, and did you measure any lost events between the kernel and the agent?
One problem that DTrace has is that the "pid" provider that you use for userspace app tracing only works on processes that are already running. So, if more processes with the executable of interest launch after you've started DTrace, its pid provider won't catch the new ones. Then you end up doing some tricks like tracking exec-s of the binary and restarting your DTrace script...