I've been using this practice and I agree that it's incredibly useful. I think because people tend to think in terms of "logs", they end up overlooking the much more useful construct of "canonical logs". Many fine-grained logs themselves are almost always less useful than the fewer fully-described canonical logs. Other observability tools often call these "events" instead of "logs" for that reason.
There's a tool call Honeycomb [1] that gives you exactly what this article's talking about in a really nicely designed package out of the box. And since it handles all of the ingestion and visualization, you don't have to worry about setting up Kafka, or the performance of logplexes, or teaching everyone SQL, or how to get nice graphs. I was a little skeptical at first, but after using it for over a year now I'm completely converted.
If you record fully-described "events" for each request, and you use sub-spans for the smaller segments of requests, you also get a waterfall-style trace visualization. Which eliminates the last need for fine-grained logs completely.
If this article seems interesting to you, I'd highly, highly recommend Honeycomb. (Completely unaffiliated, I just think it's a great product.)
> The most effective way to structure your instrumentation, so you get the maximum bang for your buck, is to emit a single arbitrarily wide event per request per service hop.
> We're talking wiiiide. We usually see 200-500 dimensions in a mature app. But just one write.
We’ve just moved over to structured event based logging, but we’re still using tools like Sumo, newrelic, etc and it’s not all nicely tied together.
I’ve been looking at Honeycomb to either replace or augment our existing tools but had little to go on besides the marketing pages.
It does cost a bit more though :-)
It seems like that works, but I'm also unclear if maybe each sub-span is better off as its own log line? But that carries its own problems.
And addressing the parent's point on databases: they don't look like an RDMS, but you can kind of think of log management/querying systems like Splunk et al. to be like a specialized database with specific properties:
- Flexible indexing: Logs change frequently which makes keys come and go, so it's convenient not to not have to be constantly building new indexes to make them searchable.
- Optimized for recent data: Newer logs tend to be accessed relatively frequently and older logs much more rarely (if ever), so it's generally a good idea for these systems will rotate data through different tiers of storage as they age — the new on fast machines with fast disks, the old on slower machines with large disks, and the very old probably just in S3 or something.
- High volume: Any of the traditional relational databases would have a lot of trouble with the volume of data that we put through Splunk. (That said, its problem domain is more constrained — it scales horizontally much more easily because it doesn't have have to concern itself with things like consensus around write consistency.)
That said I’m 100% positive that all those key value pairs are indexed for searching and querying purposes.
That said, it's also a surprisingly non-obvious idea in many respects — a lot of people are used to just traditional trace-style logging and never come up with a construct like them, so we felt they were worth calling out as something that might be worth doing.
At LogSense.com we actually tackled this problem too and came with automatic pattern discovery that pretty much converts all logs into structured data. I actually just posted it here: https://news.ycombinator.com/item?id=20569879 I am really curious if this is something that you consider helpful and any feedback is very welcome
I suspect as a payment processor though, being able to look back far when investigating breeches etc would be important.
Call them events and you can claim all the event-sourcing buzzwords too.
One point that I'd try to convey is that the canonical line technique works for any kind of structured format. We use logfmt in all our examples, but JSON would work just as well.
What I can't find described for logfmt is how to handle escaping special characters in the keys or values. Specifically the quotes and equals.
JSON defines this very well, but Splunk simple key=value logging formats suffer from escaping and parsing problems.
Disclaimer: I have no affiliation w/ the project or its maintainer -- but out of gratitude I mention it pretty much every time it's appropriate.
ts="2019-03-18 22:48:32.990" event="Request started" http_method=POST http_path=/v1/charges request_id=req_123
the main difference is that you make easier the parsing since many tools can parse lOgfmt without problems.One interesting use-case here for 'me' is the ability to perform queries in a schema-less fashion and I will do a quick speech on what we are working on Fluent Bit[0] (open source log project), pretty much the ability to query your data while is still in motion (stream processing on the edge[1]). Consider the following data samples in a log file:
ts="2019-03-18 22:48:32.990" event="Request started" http_method=POST http_path=/v1/charges request_id=req_123
ts="2019-03-18 22:48:32.991" event="User authenticated" auth_type=api_key key_id=mk_123 user_id=usr_123
ts="2019-03-18 22:48:32.992" event="Rate limiting ran" rate_allowed=true rate_quota=100 rate_remaining=99
ts="2019-03-18 22:48:32.998" event="Charge created" charge_id=ch_123 permissions_used=account_write team=acquiring
ts="2019-03-18 22:48:32.999" event="Request finished" alloc_count=9123 database_queries=34 duration=0.009 http_status=200
so if I wanted to retrieve all events associated for user 123 I would process the file as follows: $ fluent-bit -R conf/parsers.conf \
-i tail -p alias=data -p path=canonical.log -p parser=logfmt \
-T "SELECT * FROM STREAM:data WHERE user_id='usr_123';"
-o null -f 1
the output is: [1552949312.991000, {"event"=>"User authenticated", "auth_type"=>"api_key", "key_id"=>"mk_123", "user_id"=>"usr_123"}]
the results are in a raw mode but can be exported to stdout in json, to elasticsearch, kafka or any output destination supported.One of the great things of the stream processor engine is that you can create new streams of data based on results, use windows of time (tumbling) for aggregation queries and such.