Monitoring demystified: A guide for logging, tracing, metrics (opens in new tab)

(techbeacon.com)

487 pointsmalechimp5y ago92 comments

92 comments

65 comments · 12 top-level

KaiserPro5y ago· 19 in thread

A few things I have learnt along the way:

Logs are great, but only once you've identified the problem. If you are searching through logs to _find_ a problem, its far too late.

Processing/streaming logs to get metrics is a terrible waste of time, energy and money. Spend that producing high quality metrics directly from the apps you are looking after/writing/decomming (example: dont use access logs to collect 4xx/5xx and make a graph, collate and push the metrics directly)

Raw metrics are pretty useless. They need to be manipulated into buisness goals: service x is producing 3% 5xx errors vs % of visitors unable to perform action x

Alerts must be actionable.

Alerts rules must be based on sensible clear cut rules: service x's response time is breeching its SLA not service x's response time is double its average for this time in may.

twic5y ago

> Processing/streaming logs to get metrics is a terrible waste of time, energy and money. Spend that producing high quality metrics directly from the apps you are looking after/writing/decomming

Yeah nah, but, okay, nah yeah.

Generating metrics in the app is much more intrusive, and requires that you figure out the metrics you need ahead of time. It adds dependencies, sockets, and threads to your app.

Unless you're very careful, it's also easy to end up double-aggregating, computing medians of medians and other meaningless pseudo-statistics - if you're using the Dropwizard Metrics library, for example, you've already lost.

If you output structured log events, where everything is JSON or whatever and there are common schema elements, you can easily pull out the metrics you need, configure new ones on the fly, and retrospectively calculate them if you keep a window of log history.

When i've worked on systems with both pre- and post-calculated metrics, the post-calculated metrics were vastly more useful.

The huge, virtually showstopping, caveat here is that there is lots of decent, easy-to-use tooling for pre-calculated metrics, and next to nothing for post-calculated metrics. You can drop in some libraries and stand up a couple of servers and have traditional metrics going in a day, with time for a few games of table tennis. You need to build and bodge a terrifying pile of stuff to get post-calculated metrics going.

Anyway if there's a VC reading this with twenty million quid burning a hole in their pocket who isn't fussy about investing in companies with absolutely no path to profitability, let me know, and i'll do a startup to fix all this. I'll even put the metrics on the blockchain for you, guaranteed street cred.

KaiserPro5y ago

> Unless you're very careful, it's also easy to end up double-aggregating,

Oh no, never do anything fancy on the client end. yeah thats total trash. Any client that does any kind of aggregating is a massive pain in the arse.

Counters are good enough for 90% of everything you want. You can turn counters into hits per second easily. Plus they are more resistant to time based averaging. If you do your stats correctly, you can even has resetting counters create nice smooth graphs (non negative derivatives are a god send)

> Dropwizard

Yes, this is a library that argues strongly against the use of metrics. From what I recall 1 node of casasndra will output close to 50,000 metrics by default. That is too much.

When a team I worked with were migrating away from splunk to graphite/grafana, they shat out something close to a million metrics. 99.8% were totally useless.

> You need to build and bodge a terrifying pile of stuff to get post-calculated metrics going.

Yes! I think thats my main objection. Its so bloody expensive to do post-hock metrics. you can buy in splunk, but thats horrifically expensive. Or you can use an open source version and loose 4 person years before you even get a graph.

adamzegelin5y ago

> if you're using the Dropwizard Metrics library, for example, you've already lost.

Can you go into a bit more detail here? Curious to know where Dropwizzard goes wrong.

I prefer to use the Prometheus client libraries where possible. Prometheus' data model is "richer" -- metric families and labels, rather than just named metrics. Adapting from Dropwizzard to Prometheus is a pain, and never results in the data feeling "native" to Prometheus.

1 more reply

therealdrag05y ago

The problem is post calculating is so slow. At least from my naive viewpoint. I can load dozens of graphs in datadog in seconds, can change tags or time frame and takes literally a second to load. Our Splunk dashboards can take over a minute to load, and reload for any change is more waiting.

1 more reply

sagichmal5y ago

You're optimizing for the wrong thing. The hard part about this space isn't extracting value from data, it's physically shipping the data through the infrastructure and into the relevant systems. Metrics are so great compared to logs precisely because they're precalculated (read: highly compressed) before leaving the originating service.

1 more reply

ex_amazon_sde5y ago

I miss the powerful metrics and logging systems that I used in Amazon.

> Processing/streaming logs to get metrics is a terrible waste of time, energy and money. > Spend that producing high quality metrics directly from the apps

Absolutely not. Most application metric systems generate metrics as text strings with a simple format that is parsed by the metric collector.

This is what we also call a structured log. Parsing such text strings takes very little CPU.

All logs and metrics represent events. A good approach is to prefer numerical values where possible, but only for quantities that are comparable. Metrics are for the "how many?" question.

But never forget to log text events, because you need to answer the "what happened?" question.

Don't be afraid of generating too many different metrics but avoid too frequent datapoints and unnecessary verbosity in logs.

Never dump complex objects "just in case". Treat overlogging and underlogging as a bug.

Spend time every day in reviewing the metric dashboards and improve them constantly.

If it takes more that 10 seconds do add a new non-obvious chart (e.g. to calculate a ratio between 2 metrics or a percentile or other computation) throw away your charting system.

Lying with numbers is very easy: always look at distributions, not just instant values. Some metrics must be represented as percentiles and min/avg/max are meaningless.

Percentiles are good for ignoring meaningless outliers, but always count the outliers to ensure that you are not ignoring meaningful data. Especially during incidents.

Metrics and text logs tell a story together. Process, correlate and visualize them together as much as possible.

neolog5y ago

> Processing/streaming logs to get metrics is a terrible waste of time, energy and money. Spend that producing high quality metrics directly from the apps you are looking after/writing/decomming

The other side is that I don't know what metrics I'll want until later.

When do you think it's better to pull metrics from structured logs vs generating metrics in app?

viraptor5y ago

You can go the https://www.honeycomb.io/ way and make structured logs your metrics. It will cost you a lot in storage, but simplifies a lot. Just throw properly structured logs into storage as long as you query them efficiently (which honeycomb provides)

awithrow5y ago

I think the only times it ever really makes sense to use logs to generate metrics are fairly limited:

1. You haven't yet instrumented the application with metrics yet.

2. The logs are from a third party tool that don't emit metrics

3. The log format is well defined and doesn't change (I'd still prefer native metrics)

Otherwise the issue is that logging messages can and do change over the lifetime of an application. Relying on the content of the log for metrics becomes an implicit API that's not obvious to developers working on the code. I've seen issues of broken monitoring and alerting because a refactor changed log formatting and content. Much better to be explicit about metrics and instrument them directly.

KaiserPro5y ago

Aha! that is the eternal question.

TL;DR:

almost never. structured logs are expensive in terms of infra, management and query time. Storing logs just in case is much more expensive at any kind of scale compared to metrics alone.

Long answer:

A lot of it depends on what the service/program is meant to be doing.

If we take a proxying webs service router for example listening on example.com/* We would want metrics to tell us how well its doing for its specific job, and any upstream services.

So for each service URL we'd want at least a hit count for 2xx, 3xx, specific 4xx and 5xx return codes. We'd also want the time taken to process that request.

We'd also probably want to know the total number of active connection to back end, and total clients connected. Memory and CPU usage would also be a given.

From that we could easily ascertain the health of upstream services, the performance, and total load (which is useful for autoscaling of either the service router, or the upstream apps)

I think it requires sitting down with a peice of paper and imagining your service/app breaking, and then working back to see how that would look. Once you've done that, you can figure out some counters to keep track of those thins.

darkwater5y ago

On the rest I agree but on

> Raw metrics are pretty useless. They need to be manipulated into buisness goals: service x is producing 3% 5xx errors vs % of visitors unable to perform action x

I think in general the business goals metrics are OK but you still need to keep lower level metrics as well, otherwise it would be more difficult to pinpoint the exact failure, you will just know that a % of visitors is unable to perform action X. In a moderately-complex system a user-level action X is probably composed by several low-level services.

KaiserPro5y ago

I agree wholeheartedly.

I was trying to get across that just because you collect metrics it doesn't make them useful. I encourage people to generate metrics for everything, we can always join them together later to make something useful.

I think what I should have said is: "Collect metrics for everything, but be sure to display them is a way thats relevant to the customer"

gdcohen5y ago

Gavin from Zebrium here. We've found that if only you somehow knew what you were monitoring for in logs, they can be a great source of detecting (and then describing) the long tail of unknown/unknowns (failure modes with unknown symptoms and causes). Our approach is to be able to find these patterns in near real-time using ML. This blog by our CTO explains the tech with some good examples: https://www.zebrium.com/blog/is-autonomous-monitoring-the-an....

ignoramous5y ago

> If you are searching through logs to _find_ a problem, its far too late.

To be fair, this is addressed in the article which links to Netflix's blog on the topic and how they do so effectively at their scale: https://netflixtechblog.com/lessons-from-building-observabil...

msolujic5y ago

Agreed for most what was said there. Still, I find that people mostly use SLA as only thing important to track for alerting and incidents arousal. There is a lot of said about importance of defining solid SLI - Service Level Indicators which are aligned to SLO - Service Level Objectives SLAs are usually given to external user of SaaS, not very useful for SRE team.

chasers5y ago

> Processing/streaming logs to get metrics is a terrible waste of time, energy and money.

Only if you're using Elastic.

rob-olmos5y ago

I was going to try out Elastic APM for self-hosted APM option. Would that be the same case of a waste of time, energy, money? TIA for any insights!

say_it_as_it_is5y ago

The difference in metrics seems to be proportional to the level of understanding about how the organization works.

dionian5y ago

sometimes log volume would be too high (high TX count)

although in most of the systems in my career it has not been the case.

buro95y ago· 15 in thread

A lot of excellent information in that blog post and linked from it... but if you're wondering where to start:

1. Write good logs... not too noisy when everything is running well, meaningful enough to let you know the key state or branch of code when things deviate from the good path. Don't worry about structured vs unstructured too much, just ensure you include a timestamp, file, log level, func name (or line number), and that the message will help you debug.

2. Instrument metrics using Prometheus, there are libraries that make this easy: https://prometheus.io/docs/instrumenting/clientlibs/ . Counts get you started, but you probably want to think in aggregation and to ask about the rate of things and percentiles. Use histograms for this https://prometheus.io/docs/practices/histograms/ . Use labels to create a more complex picture, i.e. A histogram of HTTP request times with a label of HTTP method means you can see all reqs, just the POST, or maybe the HEAD, GET together, etc... and then create rates over time, percentiles, etc. Do think about cardinality of label values, HTTP methods is good, but request identifiers are bad in high traffic environments... labels should group not identify.

Start with those things, tracing follows good logging and metrics as it takes a little more effort to instrument an entire system whereas logging and metrics are valuable even when only small parts of a system are instrumented.

Once you've instrumented... Grafana Cloud offers a hosted Grafana, Prometheus metrics scraping and storage, and Log tailing and storage (via Loki) https://grafana.com/products/cloud/ so you can see the results of your work immediately.

If it's a big project, you have a lot of options and I assume you know them already, this is when you start looking at Cortex and Thanos, Datadog and Loki, tracing with Jaegar.

dkersten5y ago

I’d add to no. 1 by saying include a correlation id or request id so you have a way to filter the logs into a single linear stream related to the same action.

jrott5y ago

Absolutely constantly being able to get a single linear stream is when tracing becomes super powerful.

Bombthecat5y ago

Also adding a context tag can very powerful, especially in times of Microservices and some kind of event driven stuff like monthly payments.

Imagine: user registers, the post gets and id, and context registering. Then he adds a credit card. (a new id, context credit payments) after 14 days the bill goes out, same id, same context.

viraptor5y ago

> Grafana Cloud offers a hosted Grafana, Prometheus metrics scraping and storage, and Log tailing and storage (via Loki)

I haven't looked at their pricing before, but for small-ish environments, their standard plan looks really good and simple. None of the "per host, but also per function, and extra for each feature, and extra for usage" approach like other providers (datadog, I'm looking at you).

_i4zn5y ago

> None of the "per host, but also per function, and extra for each feature, and extra for usage" approach like other providers (datadog, I'm looking at you).

I was thinking "God, this is exactly why I hate Datadog" as I was reading your description and got a great laugh when I reached the end. Their billing is absolutely byzantine.

I don't know that I've ever seen a company that had such a stark difference between great engineering/product and awful business/sales practices. Their product is really the best turn-key option out there, but I'm always hesitant to use its features without double checking it's not going to add 50% to my bill. Their sales teams are some of the worst I've dealt with, and I deal with a lot of vendors. They're starting to get a really bad reputation as well.

3 more replies

rudasn5y ago

Twice I fell into the trap of datadog, having paid more than $300 for a single month in each case.

The simplicity of it, dashboards, notebooks, logs etc, is what makes it so appealing though.

1 more reply

deepGem5y ago

From the post "The key, he says, is using the right transaction identifiers so that calls can be traced across components, services, and queues".

I think this is a key feature not many people implement especially in today's world of over blown micro services, having a transaction id from the time the request hits the reverse-proxy till the database write is so helpful in debugging, saves a ton of time.

KaiserPro5y ago

100% if you manage to get opentrace to work, it is a brilliant debug tool

1 more reply

sciurus5y ago

I agree with 2. I have a presentation at https://www.polibyte.com/2019/02/10/pytennessee-presentation... which goes into how to get started with Prometheus. Not as in "how to set it up", but more about what to instrument and why, how to name things, etc. Despite the title, there's very little in it that's specific to Python.

kostarelo5y ago

> Don't worry about structured vs unstructured too much, just ensure you include a timestamp, file, log level, func name (or line number), and that the message will help you debug.

if you include all these information and the logs are not structured, you won't get much information out of them.

gnufx5y ago

Why should you have to use Prometheus? There are plenty of options, and good reasons why you might want to push data rather than pull. Measurements should minimize perturbation of the system being measured, and the (computer) system generating data is likely best placed to determine when and how, when that matters -- e.g. in HPC, where jitter is important.

chandraonline5y ago

I would add that if you add a traceId to #1 and use something like https://www.jaegertracing.io/. You get even more.

ysoft5y ago

I don't think loki is production ready. Needs more work. It's going to be great with grafana though

yellow_mixer5y ago

What else do you think it needs. I’ve been playing around with v1.5 the past few days in Grafana and I like it’s simplicity. Grafana now has a Loki datasource which is nice.

gdcohen5y ago

Gavin from Zebrium here. Completely concur with #1. We are big advocates of writing good logs and not having to worry about structured vs unstructured (and even if you structure your logs, you'll still probably have to deal with unstructured logs in third party components).

Our approach to deal with logs is to use ML to structure them after the fact (and we can deal with changing log structures). You can read about it in a couple of our blogs like: https://www.zebrium.com/blog/using-ml-to-auto-learn-changing... and https://www.zebrium.com/blog/please-dont-make-me-structure-l....

say_it_as_it_is5y ago· 8 in thread

Is there an open source solution for processing streams of structured and unstructured logs and routing then onward? I see solutions for moving logs to elastic or Kafka but nothing for evaluating the log.

ekimekim5y ago

This is a problem that is both solved again and again, but also all the available solutions are bad.

In my experience what happens is:

1. you start with a "ship logs from X to Y" product

2. you add more sources and more destinations, making it more of a central router. you add config options for specifying your sources and dests.

3. since the way you checkpoint or consume or pull or push certain sources or dests doesn't generalize, you end up buffering internally to present a unified "I have recieved / sent this message successfuly" concept to your inputs and outputs.

4. you want to do some basic transforms on the logs as you go. you implement "filters" or "transforms" or "steps" and make them configurable. your config now describes a graph of sources -> filters -> dests

5. your filters need to be more flexible. you add generic filters whose behaviour is mostly controlled by their config options. your configs grow more complicated as you use multiple layers of differently-configured filters

6. you have a bad turing complete programming language embedded in your config file. getting simple tasks done is possible, getting complex tasks done becomes an awful, inefficient and unreadable mess.

My solution to this cycle has been to just write simple hard-coded applications that can only do the job I need them to do. If they need a different configuration later I edit the source. I'm writing my transforms in a real programming language and I avoid the additional complexity of abstractions. Of course, that comes with its own costs but I consider it well worth it.

onefuncman5y ago

The "OG" in the space is collectd, which is still my favorite choice if you are responsible at the operating system level: https://collectd.org/wiki/index.php/Chains

https://github.com/elastic/logstash was one of the first modern approaches. I started using it less the more often I ran into JRuby related bugs.

https://github.com/trivago/gollum is my pick from the golang ecosystem.

There are many more variants depending on how much complexity you are trying to apply. If you need to apply machine learning models, for example, you're probably going to end up with something similar to Apache Storm, though I don't know if it's operational story has improved enough to consider it over other alternatives, I lost track years ago between Apache Spark and the half dozen other stream processing projects.

dig15y ago

Riemann [1]. You can create custom endpoint for accepting almost any kind of messages, logs or data.

[1] http://riemann.io/

EricE5y ago

>Is there an open source solution for processing streams of structured and unstructured logs and routing then onward?

https://securityonion.net

It doesn't route them onward - it will collect, aggregate and provide you the tools to correlate/analyze logs across your environment. Enable the built in network monitoring tools too and you have not only a powerful tool to help you with application management, but security as well (hence its namesake).

Beware - in pealing back the layers of your environment you can really get sucked in. I never seem to have enough hardware to do what I want with SO but it's pretty amazing what you can do with it.

EDIT - wow, I'm a little shocked that no one else has brought Security Onion up. I guess they need to up their advertising game!

gregwebs5y ago

You can actually do this log manipulation in fluent-bit (you can write Lua if you need to) although the forwarding cannot be routed to different locations.

a10c5y ago

https://vector.dev/ sounds pretty close

malechimpOP5y ago

Maybe you're looking something like this https://docs.tremor.rs/

ysoft5y ago

I haven't found anything, we are moving to hosted Humio really soon. It uses kafka

secondcoming5y ago· 5 in thread

We log extensively. Here are some of my thoughts it

- at least in C++, the requirement to be able to log from pretty much anywhere can lead to messy code that either passes a reference to your logger to all classes that might possibly need it, or you've got an extern global somewhere. Yuck.

- logging can enable laziness. Being able to log that something weird happened can be considered a sufficient substitute for proper testing.

- logs are only as useful as the info they contain. This can mean state needs to be passed around all over the place just so that it can all be eventually logged on one line (it saves your data team from having to do a 'join')

- if your logger doesn't support cycling log files it's useless. If something goes wrong you can easily fill a disk.

sagichmal5y ago

> if your logger doesn't support cycling log files it's useless. If something goes wrong you can easily fill a disk.

Few applications should be logging to disk directly. Services running under systemd or any modern orchestration platform should log to stdout/stderr and let the system manage the stream.

viraptor5y ago

I'd disagree with 2 and 4.

2. Given a large enough system you will encounter situations where the only action you can take is to log "this really shouldn't happen" and try to roll back as cleanly as possible. This may be due to either complexity or a bug manifesting in a layer completely different than where it occurred (I've seen a null reference crash on "if(foo) foo->bar();" in the past)

4. I believe loggers should ideally know as little as possible about your logs. Logs can be rotated externally, can be buffered and sent to other hosts without touching the disk, can be ignored. Ideall, the system should care, not the app.

mmkos5y ago

> I've seen a null reference crash on "if(foo) foo->bar();" in the past

References can't be null. Regardless, that's a valid check for a null pointer and I don't think what you wrote is at all possible (unless maybe in some multithreaded scenario?).

gnufx5y ago

> the requirement to be able to log from pretty much anywhere can lead to messy code

Ah, Milewski's example of insight from the supposedly useless mathematical stuff: https://bartoszmilewski.com/2014/12/23/kleisli-categories/ (and the corresponding lecture video).

bradstewart5y ago

Expanding on your second point, logging is also not a substitute for proper error handling.

kasey_junk5y ago· 3 in thread

It’s weird to see the stuff by Jay Kreps (of Kafka ~fame~) listed in the logs section. His writing is specifically _not_ about logs the observability tool, but logs the data structure such as you’d see at the heart of a database.

aloknnikhil5y ago

No. The original Kafka paper does talk about logs in the observability sense as a premise to solve the aggregation problem.

https://cs.uwaterloo.ca/~ssalihog/courses/papers/netdb11-fin...

> There is a large amount of “log” data generated at any sizable internet company. This data typically includes (1) user activity events corresponding to logins, pageviews, clicks, “likes”, sharing, comments, and search queries; (2) operational metrics such as service call stack, call latency, errors, and system metrics such as CPU, memory, network, or disk utilization on each machine. Log data has long been a component of analytics used to track user engagement, system utilization, and other metrics.

> We have built a novel messaging system for log processing called Kafka [18] that combines the benefits of traditional log aggregators and messaging systems....Kafka provides an API similar to a messaging system and allows applications to consume log events in real time.

kasey_junk5y ago

A quote from the LinkedIn blog post linked in the article:

“But before we get too far let me clarify something that is a bit confusing. Every programmer is familiar with another definition of logging—the unstructured error messages or trace info an application might write out to a local file using syslog or log4j. For clarity I will call this "application logging". The application log is a degenerative form of the log concept I am describing”

1 more reply

rollulus5y ago

Very true. Jay Krep's log is completely unrelated to the topic of this article. This added to my feeling that this "guide" is rather a collection of fragments put together without a real understanding of the subject from the author.

waihtis5y ago· 3 in thread

> Logging is critical to detecting attacks and intrusions.

Yes, but not universally - and just collecting logs will not take you far. Logging everything and trying to approach security via the ’collect all data’ is both expensive and inaccurate, and one of the major inefficiencies in modern cyber.

onefuncman5y ago

This is done efficiently at scale by both Cylance and Crowdstrike, but is certainly only one part of a defense in depth strategy.

There are viable products around human threat hunting which would be impossible without a 'collect all the data' component.

waihtis5y ago

You are correct, and this is the key part - what % of organizations have money, skills and people to build a robust enough capability around threat hunting, for example?

I’ve been super lucky to meet various orgs and their security in all geographies and many industries and my gut feeling is 1 out of 10 teams.

EricE5y ago

Security Onion does an amazing job at collecting and correlating, especially for an open source product. The traditional trade of with Open Source is there - a bit of up front effort for longer term value.

dig15y ago

The Art of Monitoring [1], covers most of these stuff in a unified manner.

You are introduced to some basics (push vs. pull monitoring), then proceeded with simple system metrics collection (cpu, memory) via collectd, then goes to logs ingestion and ends up extracting application-specific metrics from jvm and python applications.

I highly recommend it, even for seasoned professionals.

[1] https://artofmonitoring.com/

gnufx5y ago

I never see an important system management principle brought up: If you get a user complaint (for some value of "user") and not an alert, you should fix the monitoring system so that you don't get another occurrence of it or related problems. Obviously that's within reason, depending on the circumstances; the effort might not be worth it.

FrontAid5y ago

Recently, I was searching for a service which offers those functionalities on a very basic level. I tried several options and was really disappointed with all of them. The only one that I found to be usable was https://logdna.com/. I've now been using it for a couple of weeks and it works OK. It offers logging, alerts, metrics/dashboards, and some other things. And all that for a reasonable pricing.

xondono5y ago

Am I the only one that can’t reach the “save and exit” privacy button on mobile?

It’s hard for me to think that this is not intentional when the “Accept all” is usable but the alternative isn’t...

notmalc5y ago

Nice

anderspitman5y ago

If you don't need all the fancy metrics, and just want something simple to keep an eye on your services, alert you if they fail, and automatically restart them, check out my stealthcheck service. It's all of 150 lines of free range, 0-dependency go:

https://github.com/anderspitman/stealthcheck

j / k navigate · click thread line to collapse