Fast and flexible observability with canonical log lines (opens in new tab)

(stripe.com)

221 pointsmglukhovsky6y ago39 comments

39 comments

35 comments · 12 top-level

Great article! I always love hearing Stripe talking about their internals.

I've been using this practice and I agree that it's incredibly useful. I think because people tend to think in terms of "logs", they end up overlooking the much more useful construct of "canonical logs". Many fine-grained logs themselves are almost always less useful than the fewer fully-described canonical logs. Other observability tools often call these "events" instead of "logs" for that reason.

There's a tool call Honeycomb [1] that gives you exactly what this article's talking about in a really nicely designed package out of the box. And since it handles all of the ingestion and visualization, you don't have to worry about setting up Kafka, or the performance of logplexes, or teaching everyone SQL, or how to get nice graphs. I was a little skeptical at first, but after using it for over a year now I'm completely converted.

If you record fully-described "events" for each request, and you use sub-spans for the smaller segments of requests, you also get a waterfall-style trace visualization. Which eliminates the last need for fine-grained logs completely.

If this article seems interesting to you, I'd highly, highly recommend Honeycomb. (Completely unaffiliated, I just think it's a great product.)

[1]: https://www.honeycomb.io/

twic6y ago

Honeycomb's CEO on these "wide events":

> The most effective way to structure your instrumentation, so you get the maximum bang for your buck, is to emit a single arbitrarily wide event per request per service hop.

> We're talking wiiiide. We usually see 200-500 dimensions in a mature app. But just one write.

https://twitter.com/mipsytipsy/status/1042817542648082432

spimmy6y ago

^ CTO now, but thanks :D

jsmeaton6y ago

Thanks for talking about Honeycomb, I’ve been looking out for peoples experiences as it looks like quite a nice tool.

We’ve just moved over to structured event based logging, but we’re still using tools like Sumo, newrelic, etc and it’s not all nicely tied together.

I’ve been looking at Honeycomb to either replace or augment our existing tools but had little to go on besides the marketing pages.

lizthegrey6y ago

Honeybee here. Feel free to just try it, there's a 14 day free trial, and a free community edition for small amounts of data :) experiment away, and our community slack is super friendly!

1 more reply

viraptor6y ago

Yes, honeycomb is great. It's one of those "I wish I had more bigger projects, just so I could use this more" services. Other APMs / logging systems are just not really comparable.

It does cost a bit more though :-)

ucarion6y ago

I am wondering how things like OpenTracing-esque spans and sub-spans fit into the format Stripe describes. Are they just logged as `subspan1`, `subspan2`, `subspan3` in the log format?

It seems like that works, but I'm also unclear if maybe each sub-span is better off as its own log line? But that carries its own problems.

RhodesianHunter6y ago

How are these "events" different from sending this data to a metrics aggregator?

lizthegrey6y ago

Because the events are correlated with each other. When you send individual metrics, the fact that you incremented A and B together isn't visible.

firethief6y ago· 6 in thread

It's interesting that they've found denormalizing their log data so useful. I'm suprised to hear that that performs better for practical queries than a database with appropriate indexes, and that they've been able to build more ergonomic interfaces to query that than the standard relational approach a lot of people already have experience with. But I don't know much about log management at scale, so I'm only mildly surprised.

shawnz6y ago

Denormalization typically improves performance. Normalization isn't done for performance reasons but for consistency reasons, i.e. so that data isn't duplicated and there is only one source of truth

brandur6y ago

Yes, exactly — normalization is really useful for reasons of quality and correctness, but generally not so important for data like logs that's rotating through the system on a pretty constant basis.

And addressing the parent's point on databases: they don't look like an RDMS, but you can kind of think of log management/querying systems like Splunk et al. to be like a specialized database with specific properties:

- Flexible indexing: Logs change frequently which makes keys come and go, so it's convenient not to not have to be constantly building new indexes to make them searchable.

- Optimized for recent data: Newer logs tend to be accessed relatively frequently and older logs much more rarely (if ever), so it's generally a good idea for these systems will rotate data through different tiers of storage as they age — the new on fast machines with fast disks, the old on slower machines with large disks, and the very old probably just in S3 or something.

- High volume: Any of the traditional relational databases would have a lot of trouble with the volume of data that we put through Splunk. (That said, its problem domain is more constrained — it scales horizontally much more easily because it doesn't have have to concern itself with things like consensus around write consistency.)

1 more reply

manigandham6y ago

Logs can be treated as database rows regardless of source format (plaintext, csv, JSON, etc). The modern approach for dealing with large scale tables is column-oriented storage and databases which can easily handle billions of log lines without indexes by using ordering, partition maps, compression, etc.

rco87866y ago

As others mentioned, normalization is generally about saving space not increasing performance.

That said I’m 100% positive that all those key value pairs are indexed for searching and querying purposes.

RhodesianHunter6y ago

Not so much indexed, just columnar.

rodgerd6y ago

It's also about DRY and accurately modelling the data for general operations. Imagine the difference between a data environment where a GDPR deletion request comes in, and everything has a relation back to customer identity, and one where the customer identity is denormalised out to many places or only implicitly present.

sethammons6y ago· 3 in thread

This is not unlike what we've been doing for years. We generate billions of log lines like this daily as json and inspect them with splunk. By having consistent values across log lines, we can query and do neat things. "What was our system timing in relation to users who have feature x?" "What correlations can we find between users whose requests took too long and were not throttled? -> ah, 99% of those requests show $correlation_in_other_kv_pair!"

brandur6y ago

(I'm the author, and) Yeah, whatever you might call them, canonical lines are an "obvious" enough idea that I'd expect a lot of shops to have arrived at them independently. Besides yourself, I've heard from a number of people where that's been the case.

That said, it's also a surprisingly non-obvious idea in many respects — a lot of people are used to just traditional trace-style logging and never come up with a construct like them, so we felt they were worth calling out as something that might be worth doing.

pmm6y ago

I feel that logs are around for so long that it's easy to take their capabilities for granted and not go much further. This is another example there's more that can be done. It reminds me rfc5424

At LogSense.com we actually tackled this problem too and came with automatic pattern discovery that pretty much converts all logs into structured data. I actually just posted it here: https://news.ycombinator.com/item?id=20569879 I am really curious if this is something that you consider helpful and any feedback is very welcome

sethammons6y ago

Oh, for sure. A lot of folks are doing really interesting things that others could learn from and they don't stop and write something up that helps the industry grow that much more. I'm glad y'all wrote this up; I might do a similar one!

msoad6y ago· 3 in thread

is there any legal restriction of how long you can keep internal systems logs? if it's done right they don't contain PIIs but they _can_ be used to track people if you have enough logs.

rco87866y ago

Not to my knowledge. At least not in the US. CCPA (coming into effect Jan 1) does give users the right to be deleted, which presents some challenges with this sort of data but nothing insurmountable.

shawnz6y ago

I'm not sure if it's due to a legal requirement or not, but at my workplace (a university in Canada) we are required to keep all log files we produce in prod for 7 years.

traskjd6y ago

There's some vague stuff. For example, GDPR requires you only keep data for a "reasonable" period of time. So many many years would likely not be reasonable in most logging scenarios.

I suspect as a payment processor though, being able to look back far when investigating breeches etc would be important.

manigandham6y ago· 2 in thread

Strange that they went with plain text when the industry is converging on (newline delimited) JSON logs for structured data. This also serves as the backbone of observability with metrics and tracing also being folded into and output as JSON.

Call them events and you can claim all the event-sourcing buzzwords too.

brandur6y ago

I wouldn't put too emphasis on the plain text — we started logging back when carrying everything via JSON would've been going against the grain. These days it might've gone the other way (I'm not sure).

One point that I'd try to convey is that the canonical line technique works for any kind of structured format. We use logfmt in all our examples, but JSON would work just as well.

spearo776y ago

I searched for logfmt and found your website – https://brandur.org/logfmt

What I can't find described for logfmt is how to handle escaping special characters in the keys or values. Specifically the quotes and equals.

JSON defines this very well, but Splunk simple key=value logging formats suffer from escaping and parsing problems.

1 more reply

thomas5366y ago· 1 in thread

What's the primary use case for this? I almost always only look at logs to debug things; very rarely to perform some sort of event math/analysis.

hinkley6y ago

How many requests per second are you dealing with? Scale changes what you look at.

chrisweekly6y ago

Related tangent: I can't say enough good things about [lnav](https://lnav.org). It's like a mini-ETL powertool at your fingertips, w/ an embedded SQLite db and a terrific API. As of mid-2016 when I first used it, querying logs was extremely easy, and reasonably fast (w/ up to several million rows). Highest recommendation.

Disclaimer: I have no affiliation w/ the project or its maintainer -- but out of gratitude I mention it pretty much every time it's appropriate.

osswid6y ago

We've been using logging like this but with jsonl lines. Still easy to grep as straight text, but very handy to be able to parse with jq or other tools and be able to have rich values (or even substructures) as part of the log lines.

edsiper26y ago

Log structure is really important, from the examples provided I would suggest the same approach can be used using a full 'logfmt' style, so timestamp and the event type can be set as keys, e.g:

  ts="2019-03-18 22:48:32.990" event="Request started" http_method=POST http_path=/v1/charges request_id=req_123

the main difference is that you make easier the parsing since many tools can parse lOgfmt without problems.

One interesting use-case here for 'me' is the ability to perform queries in a schema-less fashion and I will do a quick speech on what we are working on Fluent Bit[0] (open source log project), pretty much the ability to query your data while is still in motion (stream processing on the edge[1]). Consider the following data samples in a log file:

  ts="2019-03-18 22:48:32.990" event="Request started" http_method=POST http_path=/v1/charges request_id=req_123
  ts="2019-03-18 22:48:32.991" event="User authenticated" auth_type=api_key key_id=mk_123 user_id=usr_123
  ts="2019-03-18 22:48:32.992" event="Rate limiting ran" rate_allowed=true rate_quota=100 rate_remaining=99
  ts="2019-03-18 22:48:32.998" event="Charge created" charge_id=ch_123 permissions_used=account_write team=acquiring
  ts="2019-03-18 22:48:32.999" event="Request finished" alloc_count=9123 database_queries=34 duration=0.009 http_status=200

so if I wanted to retrieve all events associated for user 123 I would process the file as follows:

  $ fluent-bit -R conf/parsers.conf \
               -i tail -p alias=data -p path=canonical.log -p parser=logfmt \
               -T "SELECT * FROM STREAM:data WHERE user_id='usr_123';" 
               -o null -f 1

the output is:

  [1552949312.991000, {"event"=>"User authenticated", "auth_type"=>"api_key", "key_id"=>"mk_123", "user_id"=>"usr_123"}]

the results are in a raw mode but can be exported to stdout in json, to elasticsearch, kafka or any output destination supported.

One of the great things of the stream processor engine is that you can create new streams of data based on results, use windows of time (tumbling) for aggregation queries and such.

[0] https://fluentbit.io

[1] https://docs.fluentbit.io/stream-processing

perq6y ago

Very similar to what logsense.com does with logs except you don't need to have canonical log line as patterns are found automatically.

winrid6y ago

This kind of debugging is one reason I loved Sumologic's Join query feature.

chasers6y ago

We do exactly this with Logflare and send stuff to BigQuery.

j / k navigate · click thread line to collapse

39 comments

35 comments · 12 top-level

ianstormtaylor6y ago· 8 in thread

Great article! I always love hearing Stripe talking about their internals.

If this article seems interesting to you, I'd highly, highly recommend Honeycomb. (Completely unaffiliated, I just think it's a great product.)

[1]: https://www.honeycomb.io/

twic6y ago

Honeycomb's CEO on these "wide events":

> The most effective way to structure your instrumentation, so you get the maximum bang for your buck, is to emit a single arbitrarily wide event per request per service hop.

> We're talking wiiiide. We usually see 200-500 dimensions in a mature app. But just one write.

https://twitter.com/mipsytipsy/status/1042817542648082432

spimmy6y ago

^ CTO now, but thanks :D

jsmeaton6y ago

Thanks for talking about Honeycomb, I’ve been looking out for peoples experiences as it looks like quite a nice tool.

We’ve just moved over to structured event based logging, but we’re still using tools like Sumo, newrelic, etc and it’s not all nicely tied together.

I’ve been looking at Honeycomb to either replace or augment our existing tools but had little to go on besides the marketing pages.

lizthegrey6y ago

Honeybee here. Feel free to just try it, there's a 14 day free trial, and a free community edition for small amounts of data :) experiment away, and our community slack is super friendly!

1 more reply

viraptor6y ago

Yes, honeycomb is great. It's one of those "I wish I had more bigger projects, just so I could use this more" services. Other APMs / logging systems are just not really comparable.

It does cost a bit more though :-)

ucarion6y ago

I am wondering how things like OpenTracing-esque spans and sub-spans fit into the format Stripe describes. Are they just logged as `subspan1`, `subspan2`, `subspan3` in the log format?

It seems like that works, but I'm also unclear if maybe each sub-span is better off as its own log line? But that carries its own problems.

RhodesianHunter6y ago

How are these "events" different from sending this data to a metrics aggregator?

lizthegrey6y ago

Because the events are correlated with each other. When you send individual metrics, the fact that you incremented A and B together isn't visible.

firethief6y ago· 6 in thread

shawnz6y ago

Denormalization typically improves performance. Normalization isn't done for performance reasons but for consistency reasons, i.e. so that data isn't duplicated and there is only one source of truth

brandur6y ago

- Flexible indexing: Logs change frequently which makes keys come and go, so it's convenient not to not have to be constantly building new indexes to make them searchable.

1 more reply

manigandham6y ago

rco87866y ago

As others mentioned, normalization is generally about saving space not increasing performance.

That said I’m 100% positive that all those key value pairs are indexed for searching and querying purposes.

RhodesianHunter6y ago

Not so much indexed, just columnar.

rodgerd6y ago

sethammons6y ago· 3 in thread

brandur6y ago

pmm6y ago

I feel that logs are around for so long that it's easy to take their capabilities for granted and not go much further. This is another example there's more that can be done. It reminds me rfc5424

sethammons6y ago

msoad6y ago· 3 in thread

is there any legal restriction of how long you can keep internal systems logs? if it's done right they don't contain PIIs but they _can_ be used to track people if you have enough logs.

rco87866y ago

Not to my knowledge. At least not in the US. CCPA (coming into effect Jan 1) does give users the right to be deleted, which presents some challenges with this sort of data but nothing insurmountable.

shawnz6y ago

I'm not sure if it's due to a legal requirement or not, but at my workplace (a university in Canada) we are required to keep all log files we produce in prod for 7 years.

traskjd6y ago

There's some vague stuff. For example, GDPR requires you only keep data for a "reasonable" period of time. So many many years would likely not be reasonable in most logging scenarios.

I suspect as a payment processor though, being able to look back far when investigating breeches etc would be important.

manigandham6y ago· 2 in thread

Call them events and you can claim all the event-sourcing buzzwords too.

brandur6y ago

One point that I'd try to convey is that the canonical line technique works for any kind of structured format. We use logfmt in all our examples, but JSON would work just as well.

spearo776y ago

I searched for logfmt and found your website – https://brandur.org/logfmt

What I can't find described for logfmt is how to handle escaping special characters in the keys or values. Specifically the quotes and equals.

JSON defines this very well, but Splunk simple key=value logging formats suffer from escaping and parsing problems.

1 more reply

thomas5366y ago· 1 in thread

What's the primary use case for this? I almost always only look at logs to debug things; very rarely to perform some sort of event math/analysis.

hinkley6y ago

How many requests per second are you dealing with? Scale changes what you look at.

chrisweekly6y ago

Disclaimer: I have no affiliation w/ the project or its maintainer -- but out of gratitude I mention it pretty much every time it's appropriate.

osswid6y ago

edsiper26y ago

Log structure is really important, from the examples provided I would suggest the same approach can be used using a full 'logfmt' style, so timestamp and the event type can be set as keys, e.g:

  ts="2019-03-18 22:48:32.990" event="Request started" http_method=POST http_path=/v1/charges request_id=req_123

the main difference is that you make easier the parsing since many tools can parse lOgfmt without problems.

  ts="2019-03-18 22:48:32.990" event="Request started" http_method=POST http_path=/v1/charges request_id=req_123
  ts="2019-03-18 22:48:32.991" event="User authenticated" auth_type=api_key key_id=mk_123 user_id=usr_123
  ts="2019-03-18 22:48:32.992" event="Rate limiting ran" rate_allowed=true rate_quota=100 rate_remaining=99
  ts="2019-03-18 22:48:32.998" event="Charge created" charge_id=ch_123 permissions_used=account_write team=acquiring
  ts="2019-03-18 22:48:32.999" event="Request finished" alloc_count=9123 database_queries=34 duration=0.009 http_status=200

so if I wanted to retrieve all events associated for user 123 I would process the file as follows:

  $ fluent-bit -R conf/parsers.conf \
               -i tail -p alias=data -p path=canonical.log -p parser=logfmt \
               -T "SELECT * FROM STREAM:data WHERE user_id='usr_123';" 
               -o null -f 1

the output is:

  [1552949312.991000, {"event"=>"User authenticated", "auth_type"=>"api_key", "key_id"=>"mk_123", "user_id"=>"usr_123"}]

the results are in a raw mode but can be exported to stdout in json, to elasticsearch, kafka or any output destination supported.

One of the great things of the stream processor engine is that you can create new streams of data based on results, use windows of time (tumbling) for aggregation queries and such.

[0] https://fluentbit.io

[1] https://docs.fluentbit.io/stream-processing

perq6y ago

Very similar to what logsense.com does with logs except you don't need to have canonical log line as patterns are found automatically.

winrid6y ago

This kind of debugging is one reason I loved Sumologic's Join query feature.

chasers6y ago

We do exactly this with Logflare and send stuff to BigQuery.

j / k navigate · click thread line to collapse