Specifically, how to make the sum of all monitored "pillars" more useful than each of them individually.
3 major pillars being:
- Metrics (whether application or higher-level of the stack, like OS)
- Logs (whether structured or unstructured)
- Traces
Observability is these major pillars and how to easily "jump" from one to another to very quickly identify the root cause of an issue. I.e. go Metrics <-> Logs, Logs <-> Traces, or Metrics <-> Traces,
For instance, with good Metrics, one can easily figure out & get alerts when there is a large spike of 500 errors. But when Metrics & Logs can work together, one can easily see the exception from stack trace that are emitted with those 500 errors.
Similarly, with good Metrics, one can easily figure out that the frontend service latency p90 has increased by 5x. But with Metrics & Traces working together(for instance via Exemplar[1]), one can look at a bunch of the traces that have a very high latency, and identify the upstream service responsible for this increase.
With Monitoring only, you could get a nice Metrics solution in place, with fancy alerting rules, but all it was good at is informing you "Something bad is currently happening". With a good "Observability" setup, you should also be able to change it to "Something bad is currently happening and the root cause is right here."
[1] https://grafana.com/docs/grafana/latest/basics/exemplars/
You cannot compare the two. Yet we do in certain circumstances and that is a loss in understanding, which I am sad about.
Create a great vendor-agnostic open source tech. Get everyone riled up about the dangers of vendor-locking solutions. Use the new tech to carve yourself a piece of the market from the current incumbent.
It is pretty great and all, but sometimes it is easier to build your app with a simple vendor-locked tech than a super generic agnostic technology.
It’s kinda important to understand who all this is meant for. If you’re a lean startup just use the best/cheapest/quickest tool regardless of vendor lock in. It’s when you get to a certain scale that vendor agnosticism becomes a real concern, but by then you probably have enough resources to hire folks that will rebuild your stack.
That's for sure golden. Until the product is bought over, or there was a merge of companies, or you name it. And then you end up with a pile of products with different vendor-locked log formats, metrics. At that point you'd like to get some standardization. And OpenTelemetry is a perfect candidate for common ground. Thus support of OpenTelemetry becomes a major decision factor when selecting a vendor or OSS solution for your problem, isn't it?
It was called monitoring.
Monitoring to me is exclusively about metrics and alerts. Metrics are really useful but they often don’t give you the whole context, and might sometimes be misleading. Eg you see a spike in cpu usage for a service; you probably just autoscaler and call it a day, and that’s the end of it. Having metrics is SO much better than not having metrics though, it delivered insights that were just not possible before.
Observability to me is the next iteration of this process of understanding system behavior. Metrics are limiting, so maybe you look at logs. Well, they suffer from some of the same issues, so you try profiling, you try tracing. Ultimately the goal is to explore tools that allow developers to quickly get a truthful understanding of how their systems really work, and use that knowledge to improve their systems.
If it is a critical process, use detailed process logs.
If it is an extremely critical process add Auditing.
Regular logs (text files) > Detailed Process Logs (DB) > Audits (read-only/tamper-proof)
What's the difference between Regular logs, Detailed Process Logs and Audits?Are Audits more compliance related, while detailed logs are business only?
It's not clear to me why to have a special detailed process log, if you can you can write important data to Audits, and less important to regular logs, i.e. once can use different log levels for business data vs engineering/debug data.
Can you please give an example?
Process logs are business processes that you need to report on or answer for later.
Audits are another business layer for safety or life critical data, or adversarial data (handling money based transactions). They could be compliance related. Auditing is a three step process - Ask a question, figure out how to answer it, and make sure you've answered it.
So specifically, if you have a magic bean shop and you want to log transfer of the beans, that would be a process log. Then at the end of the day, you would send a manifest of all the beans sent that day. The audit process would look at the final location of the beans, and also at the process log to answer your audit questions.
---
I think a lot of people make a hard distinction between business logging and engineering logging, and I'm not sure that makes sense.