Logging can be tricky (opens in new tab)

(corner.squareup.com)

138 pointshariharan_uno11y ago74 comments

74 comments

mml11y ago

Reminds me of the time I sped up the main business app of a large company by 85% by removing "debug" logging. 2tb/hr of "made it here" isn't really useful at the end of the day.

Not the first time I've seen that by a long shot.

//shakes zimmerframe, shuffles off

derefr11y ago

Thus why your logging library should have a configurable log-level, and the first thing the user-facing log() function should do is check the arguments against the log-level and early-return if the message isn't important enough.

(In languages that support macros, you could also just make debug_log() a macro that gets compiled out on release builds, like assert(). That means a whole bunch of object files are going to be functionally different between debug and release builds, though, which is annoying for a lot of operational reasons.)

My favourite solution by far, though, is the one recently implemented by Elixir: log() is a regular function, but takes a closure instead of a string. That way, you can actually put all the work required to generate the log message (inspecting data structures, doing intermediate lookups, etc.) inside the closure, and it all gets elided when you decrease the log-level.

icambron11y ago

Macros let you do option three, too, by just guarding the body with an if in the generated code. That way you avoid having to use an anonymous function each time. So in Lisp (log level (construct-a-string)) instead of (log level (fn [] (construct-a-string))), but with the same elisions logic.

mml11y ago

At least in Java, log levels don't buy you much when you wind up possibly rendering object graphs to strings (via your ohsohelpful custom, recursive tostring implementation), then duping it to concat "made it here" to it, then immediately abandon the mess on the heap for the gc to handle, since the production log level is warn, so your carefully built, highly detailed log message is never used. Repeat every other line or so... In tight loops, call iter.next in log message and on and on...

yummyfajitas11y ago

That's not a java problem, that's an "I don't understand Java logging" problem. The right way:

    log.debug("I created this gigantic object {}",
              myBigUglyObject)

arethuza11y ago

"possibly rendering object graphs to strings"

But that's a problem with how you choose to use logging in your application - not with logging in general or specific logging features in Java.

Logging used without any care can cause lots of problems - but then so can any part of an application. Used with some care it can be a very useful tool for helping to diagnose all kinds of problems.

thrownaway242411y ago

Some java servers I have had the misfortune of diagnosing were found to be spending the majority of their CPU time (aside from garbage collection of course) formatting exception stack traces.

idunno24611y ago

which is why slf4j has format strings

egwor11y ago

... or your recursive reflective tostring implementation....

heavenlyhash11y ago

> My favourite solution by far, though, is the one recently implemented by Elixir: log() is a regular function, but takes a closure instead of a string.

This is a pretty fantastic route as far as syntactical simplicity goes.

I wonder if it can have an implications on the generated assembly nonetheless? Could there be extra assembly generated outside the branch for logging level check, which is necessary to bind the variables for the closure?

Similarly, I might suspect that an optimizing compiler that is considering a function for inlining might change its mind for any function where it sees a nontrivial closure like that, without knowing that "trace" level logging nearly never happens.

"Edge cases", to be sure, and the answers are surely different between languages, but interesting to consider nonetheless.

icambron11y ago

You may know this, but maybe it's worth pointing out that the big savings here is actually just that the function itself can be pretty expensive, since it may stringify some data structure. In eagerly evaluated languages, if you say this:

    log("debug", "I found: " + foo.toString());

you end up doing a lot of work to create that string, only to have it be ignored by the log call. Whereas if you say:

    log("debug", => "I found: " + foo.toString());

any extra assembly you execute (or generate) to make that work is peanuts compared to avoiding the call to toString(). None of that (besides the simple thing of log levels) helps with the OP's problem with IO, but it can make a real difference to your CPU.

masklinn11y ago

> My favourite solution by far, though, is the one recently implemented by Elixir: log() is a regular function, but takes a closure instead of a string.

A pretty common halfway solution (used by Python's logging for instance) is that the logging library takes a format string and its parameters, and does the formatting itself.

Doesn't save intermediate lookups or complex computations if they exist, but for the common case of a complex stringification it can save your bacon.

The alternative is an explicit conditional around the logging call, which is definitely worse than the ability to log a closure.

girvo11y ago

Monolog does that, you call the level that you want so you can early-return based on what environment/config you've got set up. Quite nice, means I can keep my debug logs around and deploy the same code to production without worrying about it. I think the PSR standard for logging actually specifies this, too.

eropple11y ago

Enclosed log evaluation is really nice and I'd have some trouble going back. It's older than Elixir, too; Commons.Logging in C# has supported it for a long time and most logging frameworks in Scala take a call-by-name parameter rather than an evaluated arg.

1 more reply

aidos11y ago

Clever. I like that (using the closure).

I've never used Elixir but toyed with Erlang many years ago - is it called asynchronously by the logging process in Elixir? If so, I guess you're relying on immutable data structures to make sure you get the correct logging output?

derefr11y ago

Everything goes through an async message send to a gen_event (error_logger). The gen_event itself can have annoyingly synchronous+serial behavior with respect to its backends, but gen_event handlers aren't supposed to do much heavy-lifting; they mostly just decide whether they want a message, and then forward it to some backend process to do the formatting+writing+etc.

This gives you another neat benefit, nearly for free: if you have, say, application Foo on node A, wanting to log to SyslogSink on node B, then Foo first sends its logs to a gen_event on A, and the SyslogSink handler in that gen_event gets the opportunity to drop messages, using whatever logic SyslogSink likes, before forwarding them over to B (which might be half-way across the Internet.)

allegory11y ago

Not the only one. Someone set up NHibernate on a project I was working on to run logging in DEBUG mode and throw it in a log file capped at 100k. Pages were taking 4-5 seconds to render. Wasn't obvious that the logs were being written as the file was capped at 100k.

Took ages to work that out.

Ugh.

cryowaffle11y ago

Grey text on white background? Please help!

samstave11y ago

You just dont understand sleek, minimal design. How pedestrian.

pasbesoin11y ago

https://www.squarefree.com/bookmarklets/

ibar11y ago

I had a similar problem with java -- except that the entire application would freeze for double digit seconds. Another application would sometimes write a huge amount of data out very quickly to the fs cache. 30 seconds later (or w/e the expiration is), all those dirty bytes would get sync'd to disk more or less at once.

Turns out it was the JVM provided GC logging hanging on flush (not even fsync) calls. The flush call was during GC, and while the GC implementation held a stop the world lock. Digging through JVM source code is 'fun'.

thrownaway242411y ago

JVMs can also block on class loading in the presence of a heavy writer. This can go on for minutes, or indefinitely. Where I work we deploy jars to tmpfs to avoid this irritating cause of high latency.

lkrubner11y ago

I think I have exactly this problem. How do I confirm this? "strace" does not show me any obvious problems. Where would I look to see if the JVM "flush" was giving me problems?

ibar11y ago

Well, the easiest way to confirm is to turn off GC logging and see if your huge "GC pauses" go away. Alternatively, you could turn up the verbosity a bit - there's an extra flag for details about reference processing. In my case I was able to track down the exact line in the JVM from the log lines attributing the huge pauses to whatever trivial component occurred right after the first flush call.

If you record system metrics (eg. to ganglia) then you can also attempt to correlate large pauses to a large and rapidly declining number of dirty bytes in the fs cache.

jcampbell111y ago

If all three machines had the same logging code, and one machine was fsync'ing slowly, isn't turning off fsync just a bandaid that hides the true problem?

When they discover the actual problem is some issue with the raid controller, I promise not to say "I told you so".

evan_miller11y ago

Hi, post author here. I didn't make it as clear as I could have, but the difference is that the problematic system had an unrelated process creating slightly more writes. I sort-of glossed over this with the 5%-20% difference in i/o util.

Unrelated write activity on a filesystem can cause cause fsync() calls in any other process to vary wildly in latency. This can be replicated, here's an experiment for you. First, run this:

strace -T -efsync ruby -e'loop { STDOUT.fsync; puts "a" * 120; sleep 0.1 } ' > ~/somefile

Then, in another terminal do a little bit of writing -- make sure it is on the same filesystem. For example:

dd if=/dev/zero of=~/someotherfile bs=4M count=1

On my poor little aws VM, here is what I see:

fsync(1) = 0 <0.025072>

fsync(1) = 0 <3.930661>

fsync(1) = 0 <0.024810>

That is, writing 4 megabytes in an unrelated process caused fsync() to jump two orders of magnitude.

Removing fsync() is an appropriate fix because we don't really ever want to flush this data to durable storage.

jcampbell111y ago

I probably read "an otherwise uniform cluster" too literally.

20% utilization on a resource that can only do 100 random operations per second is a major problem, and way different from 5%.

A piece of hardware that operates at 100 Hz, with 20% utilization, will block for about 135ms+ for the 95th percentile request. With 5% utilization, it will block for about 10ms for the 95% percentile request.

My quick calculations are somewhat below the 200ms discrepancy you show in the chart, but not far off.

Of course, turning off fsync is a perfectly good solution. Longer term, I would move to SSDs and just make this entire class of problem go away. I don't even have any spinning platters to test your strace on.

jacques_chester11y ago

Computers are chaotic systems. They're deterministic, but changes in starting conditions below the threshold of observation can lead to wildly different outcomes. That's why they're usually unpredictable.

tendeer11y ago

While I really like the article and its analysis, I think there is a more general point here - scale is tricky.

Most programming abstractions and tools (such as Logging / UI design / persistence) across instances are all "easy" when it comes to one machine (or a few machines) in a closed, deterministic environment.

All of these become extremely tricky when you have to deal with many instances across many different possibly indeterministic environments. For UI, it quickly becomes what's the viewport? What's the real estate? How is interaction (touch? mouse? keyboard? numpad?). For persistence, one has to worry about number of simultaneous connections, character encoding and normalization, read / writes per second, synchronisation, etc..

So, also for logging - Are we logging client side or over the network? what's the memory on the client? What else is running on the client? What's the permissions through which we log? Do we care about a system call on this client given any user interaction? etc..

Logging, UI, Persistence, Network protocols, heck even just choosing an intermediate data structure are all tricky .. at scale across devices and indeterministic environments.

dclusin11y ago

Compressing the text logs prior to writing them to the disk also helps with these kinds of issues. You can also offload your logging to a dedicated thread and then use a lock free queue to increase your performance even more.

derefr11y ago

> You can also offload your logging to a dedicated thread and then use a lock free queue to increase your performance even more.

Or use syslog() like a sane person, and let that "extra thread" live inside the OS IPC mechanism.

(Or stdout/stderr like a modern sane person, and let upstart/systemd/docker/etc. push your logs to syslog if that's where it feels like pushing them.)

heavenlyhash11y ago

The OS IPC mechanism quite likely does not use a lock free queue -- or at least, not in quite the same way as I think the grandparent post refers to.

Using a well implemented ring buffer [+] can get enqueue operations down to a few instructions and something like two memory fences.

The overhead of IPC, which wakes up the kernel scheduler, switches the processor back and forth between privilege modes a few times on the way, knocks apart all the CPU cache and register state to swap in another process, while the MMU is flipping all your pages around because these two processes don't trust each other to write directly into their respective memory... is not going to have quite the same performance characteristics.

An moment in the history of logging is java's log4j framework, which, within a single process, used exclusive synchronization. When this was replaced by a (relatively) lock-free queue implementation, throughput increased by orders of magnitude. (Their notes and graphs on this can be found at https://logging.apache.org/log4j/2.x/manual/async.html .) This isn't an exact metaphor for the difference between a good lockfree ringbuffer and IPC either, but it certainly has some similarities, and indeed ends with a specific shout-out to the power of avoiding "locks requiring kernel arbitration".

[+] The "mechanical sympathy" / Disruptor folks have some great and accessible writeups on how they addressed the finer points of high performance shared memory message passing. http://mechanitis.blogspot.com/2011/06/dissecting-disruptor-... is one of my favorite reads.

derefr11y ago

I think you're thinking of a synchronous RPC mechanism. I was talking about IPC mechanisms like unix domain sockets, where sending the message doesn't interact at all with the receiving process, but literally just sticks it into a buffer, where the other process can come and get it later.

1 more reply

hoov11y ago

> Or use syslog() like a sane person, and let that "extra thread" live inside the OS IPC mechanism.

As it turns out, we had nearly the same symptoms as the OP. After running strace on syslogd (CentOS 5), I realized that it also does an fsync() after every call. A quick trip to the man page and a configuration change later, our issues disappeared.

erikb11y ago

Evan, I would be happy if you could explain more about what you see in the strace outputs. E.g.:

> Time spent in fsync accounts for almost exactly the delay we were seeing.

What delay? I see the whole thing taking 1.5 seconds and 1.3 seconds spent in futex (0.4 more than on the normal host). Not sure, why we are suddenly talking about fsync. I also don't know what either method (futex, fsync) could be doing.

All these are not questions I want answers to (some stuff I could google of course). I just want to show that it's a rather confusing read for some readers if you expect them to understand the strace outputs as well as you do, when you seem to be using that tool on a daily basis and the readers might not have used it at all, ever. It would be great to follow your insides better. Just small additions like the following would help a lot: "[The X seconds] spent in fsync [seen in diagram A] accounts for almost exactly the delay we were seeing [in diagram B]".

zorbo11y ago

> Just small additions like the following would help a lot: "[The X seconds] spent in fsync [seen in diagram A] accounts for almost exactly the delay we were seeing [in diagram B]".

Exactly. I try to always point to the actual data when I'm telling my readers my conclusions.

"Calls to futex, epoll_wait and select are largely the same on both systems. The problem host does 1727 calls to fsync(), while the normal host does only 849 calls. The problem host spends 1.24% of its time in fsync(). The normal host only 0.48%. This makes fsync() our number 1 suspect."

This at least lets the reader refer back to the output and see corresponding entries.

evan_miller11y ago

Ah, thanks for your feedback. I'm never sure how much detail I should dive into. I will add more next time. I'll answer them here just for practice :)

Recall, this is a problem in the 99th percentile. That is, on average our requests are still taking around 20ms to complete -- but one in every hundred or so takes about 200ms.

The first strace output shows a summary table of time, including both our fast 20ms queries and the 1% of slow 200ms queries. In the summary view I am looking not simply at the largest number in the summary but rather indications that my problematic system is spending time doing something that my healthy system is not. Time spent is a zero-sum equation -- the app is either doing what it's supposed to, or not. futex() is where it should be spending time while idle. fsync() time doubles proportionally, which means it's taking time away from the otherwise productive (or, idle) operations. Which means it's potentially the problem.

When I trace the fsync call individually I see a time delta roughly on par with my per-request delay for the slow 1% of queries. Which indicates to me that this call is happening during the slow queries, accounting for the delay in its entirety.

sargun11y ago

Why would you fsync logs for a high-level service? Are you afraid a power outage is going to cause you to lose service logs?

TheLoneWolfling11y ago

Making sure your logging data gets written is relatively important.

Say, if you hit a bug somewhere that causes the server itself to crash.

frankzinger11y ago

How about installing a SIGSEGV, etc. handler to do the fsync (and perhaps even print a stacktrace)?

Rapzid11y ago

Disabling buffering in the file library should provide for a similar mitigation outlook. If the OS accepts the writes it WILL make it to disk unless the some part of the storage sub system(or the os itself) fails. So if the process dies recent writes should make it to disk regardless. But if the process triggers a complete system crash then you're kinda stuck needing fsync's :|

2 more replies

evan_miller11y ago

The fsync() is implicit behavior of the glog framework for Go. We didn't actually want it.

govindkabra3111y ago

Did anyone understand why logging calls on only problem machine had different fsync behavior than normal machines?

evan_miller11y ago

Hi, post author here. The slight difference in i/o (specifically: writes) was the trigger. I talked a little more about that here: http://corner.squareup.com/2014/09/logging-can-be-tricky.htm...

And here: https://news.ycombinator.com/item?id=8359556

Even on the problematic host we only saw this latency issue in the 99th percentile. That is: even on the problem host 99 out of 100 queries were served as expected and only 1 out of 100 saw this additional latency.

unclebucknasty11y ago

Well, yeah, I noticed you guys' response to one of the comments on the blog post indicated that the problem machine had a different workload (additional tasks or something). That caused the additional writes, which then caused the latency for the main app on the box.

I think your point still stands about logging, being cautious about blocking I/O calls, etc. But, it seems the bigger point is one of how your overall system is architected, which proccesses run where, dedicating like nodes to their tasks vs. potential quality/consistency issues arising from having some pull double-duty, etc.

Those seemed to be the source of the real issue here.

evan_miller11y ago

Sort of. The catch is that even a very small write, say just a few megabytes, can drastically change the cost of an fsync(). On my test aws VM even writing just 4 megabytes one time is enough to trigger the problem. Even on an otherwise fully isolated system a few megs may be written from time to time, for example by a management agent like chef or puppet. Or by an application deploy copying out new binaries.

For example, here I reproduce the problem on a completely isolated machine: https://news.ycombinator.com/item?id=8359556

2 more replies

kordless11y ago

If you want tricky, try building a logging service and then enable logging to itself.

vacri11y ago

I once enable logging on an AWS S3 bucket... which was our 'logs' bucket... which was the stated destination for the bucket logging. It wasn't a particularly costly error, but there was a lot of logspam...

nkozyra11y ago

Does anyone think about doing logging to shared memory / memcached and then committing snapshots to disk at regular intervals via another process/machine?

If you're not all that concerned about consistency, each web server can keep their logs in a segregated memory space and then another process can combine/commit and send a flush command, leaving the primary machines relatively unencumbered.

evan_miller11y ago

With fsync() taken out of the equation this is essentially how a naive logger operates. The linux page cache plays the role of shared memory and only under rather heavy contention will a write() incur latency.

The trouble outlined in my blog post is that the logging framework was calling fsync() -- that is, specifically asking to flush the page cache all the way to disk.

jacques_chester11y ago

Have you experimented with a log-structured filesystem? I once saw similar high-variation behaviour in a logging problem on extfs. Switching to NILFS pretty much got rid of variable latency entirely.

evan_miller11y ago

Not recently, no. I do know that a journaled filesystem can exacerbate this sort of problem as it can make extra work. For example: http://lwn.net/Articles/328363/

In a few cases in the past when dealing with unimportant data I have downgraded to ext2 for a nice performance bump.

1 more reply

spydum11y ago

Isn't this essentially syslog-ng?

nkozyra11y ago

Does syslog-ng use shared memory? I thought it communicated via open socket network channels.

spydum11y ago

apps can log using the syslog primitives. I do not believe the require network sockets. What syslog-ng decides to do with it (flush to disk, or push over network socket), is no different than the proposal of shared mem to another process (which then has to make the same decision).

digikata11y ago

Doe this increase the chance that you lose a bit of critical log if a fault causes the system to go down?

jewel11y ago

Yes, but if the logs are just for debugging an application, it usually doesn't matter. Application logs are useful for figuring out why an application crashed, but shouldn't be relevant for figuring out why a system went down. However, if the app crashes while some log information is still in the system buffers, the system will still write them out, even if fsync() is never called.

It's been a few years since I looked at this, but it seems like there's only a small window that data will stay in the buffer before it will get automatically flushed. The nice thing about this automatic flush is that it won't cause the application to block.

viraptor11y ago

Also if your logging plugin / framework / module allows logging to remote syslog, that's a pretty good idea - no disk blocking. Although you are then risking data loss when the network goes down...

canterburry11y ago

Wow...this post made me think I shouldn't even attempt to run my own server infrastructure for my startup. This kind of analysis is way deeper than I'm currently capable of.

rjbwork11y ago

Hell, we're now infrastructure-less. No VM's (except for our QA to test on, doesn't count), no static servers we can't just destroy and spin up another of in <5 mins, and no database servers that we have to manage.

It truly is a joy.

pit11y ago

God bless you -- you are doing the Cloud's work.

oftenwrong11y ago

Curious: How do you persist data with no database servers? Using an outside service to store it?

rjbwork11y ago

We used Azure SQL for our RDBMS requirements, and blob/table storage for our NoSQL stuff. Sure there's a server behind it, but it's transparent to us and it's geo-redundant and replicated. When i say "no servers" what I mean is, we don't need any system admins and very little ops folks. In fact at the moment, devs are pretty much ops since we need very little of it.

I also forgot to mention we have a build server, but thats just a virtual machine that we backup regularly. It's our only piece of static infrastructure. Everything else, except storage which is, again, transparent to us, is destroyed on a weekly, if not daily, basis for new releases.

kiyoto11y ago

Logging to files creates a lot of headache. I would say the tail plugin for Fluentd (the open source log collector that I am a maintainer of: www.fluentd.org), I would say 1/3 to 1/2 of the questions are related to tailing log files.

I think, fundamentally, files are the wrong abstraction for log data (as pointed out by Adam Wiggins of Heroku fame: http://12factor.net/logs), but it is something that we are stuck with.

mattmanser11y ago

Unless you're really unlucky/using young tech/bad at programming you won't need that kind of expertise unless you have the kind of traffic which means you can afford to hire that expertise.

mleonhard11y ago

I think this problem is mostly due to using Go, whose libraries are not mature.

j / k navigate · click thread line to collapse

74 comments

mml11y ago

Reminds me of the time I sped up the main business app of a large company by 85% by removing "debug" logging. 2tb/hr of "made it here" isn't really useful at the end of the day.

Not the first time I've seen that by a long shot.

//shakes zimmerframe, shuffles off

derefr11y ago

icambron11y ago

mml11y ago

yummyfajitas11y ago

That's not a java problem, that's an "I don't understand Java logging" problem. The right way:

    log.debug("I created this gigantic object {}",
              myBigUglyObject)

arethuza11y ago

"possibly rendering object graphs to strings"

But that's a problem with how you choose to use logging in your application - not with logging in general or specific logging features in Java.

thrownaway242411y ago

Some java servers I have had the misfortune of diagnosing were found to be spending the majority of their CPU time (aside from garbage collection of course) formatting exception stack traces.

idunno24611y ago

which is why slf4j has format strings

egwor11y ago

... or your recursive reflective tostring implementation....

heavenlyhash11y ago

> My favourite solution by far, though, is the one recently implemented by Elixir: log() is a regular function, but takes a closure instead of a string.

This is a pretty fantastic route as far as syntactical simplicity goes.

"Edge cases", to be sure, and the answers are surely different between languages, but interesting to consider nonetheless.

icambron11y ago

    log("debug", "I found: " + foo.toString());

you end up doing a lot of work to create that string, only to have it be ignored by the log call. Whereas if you say:

    log("debug", => "I found: " + foo.toString());

masklinn11y ago

> My favourite solution by far, though, is the one recently implemented by Elixir: log() is a regular function, but takes a closure instead of a string.

A pretty common halfway solution (used by Python's logging for instance) is that the logging library takes a format string and its parameters, and does the formatting itself.

Doesn't save intermediate lookups or complex computations if they exist, but for the common case of a complex stringification it can save your bacon.

The alternative is an explicit conditional around the logging call, which is definitely worse than the ability to log a closure.

girvo11y ago

eropple11y ago

1 more reply

aidos11y ago

Clever. I like that (using the closure).

derefr11y ago

allegory11y ago

Took ages to work that out.

Ugh.

cryowaffle11y ago

Grey text on white background? Please help!

samstave11y ago

You just dont understand sleek, minimal design. How pedestrian.

pasbesoin11y ago

https://www.squarefree.com/bookmarklets/

ibar11y ago

thrownaway242411y ago

lkrubner11y ago

I think I have exactly this problem. How do I confirm this? "strace" does not show me any obvious problems. Where would I look to see if the JVM "flush" was giving me problems?

ibar11y ago

If you record system metrics (eg. to ganglia) then you can also attempt to correlate large pauses to a large and rapidly declining number of dirty bytes in the fs cache.

jcampbell111y ago

If all three machines had the same logging code, and one machine was fsync'ing slowly, isn't turning off fsync just a bandaid that hides the true problem?

When they discover the actual problem is some issue with the raid controller, I promise not to say "I told you so".

evan_miller11y ago

Unrelated write activity on a filesystem can cause cause fsync() calls in any other process to vary wildly in latency. This can be replicated, here's an experiment for you. First, run this:

strace -T -efsync ruby -e'loop { STDOUT.fsync; puts "a" * 120; sleep 0.1 } ' > ~/somefile

Then, in another terminal do a little bit of writing -- make sure it is on the same filesystem. For example:

dd if=/dev/zero of=~/someotherfile bs=4M count=1

On my poor little aws VM, here is what I see:

fsync(1) = 0 <0.025072>

fsync(1) = 0 <3.930661>

fsync(1) = 0 <0.024810>

That is, writing 4 megabytes in an unrelated process caused fsync() to jump two orders of magnitude.

Removing fsync() is an appropriate fix because we don't really ever want to flush this data to durable storage.

jcampbell111y ago

I probably read "an otherwise uniform cluster" too literally.

20% utilization on a resource that can only do 100 random operations per second is a major problem, and way different from 5%.

My quick calculations are somewhat below the 200ms discrepancy you show in the chart, but not far off.

jacques_chester11y ago

tendeer11y ago

While I really like the article and its analysis, I think there is a more general point here - scale is tricky.

Logging, UI, Persistence, Network protocols, heck even just choosing an intermediate data structure are all tricky .. at scale across devices and indeterministic environments.

dclusin11y ago

derefr11y ago

> You can also offload your logging to a dedicated thread and then use a lock free queue to increase your performance even more.

Or use syslog() like a sane person, and let that "extra thread" live inside the OS IPC mechanism.

(Or stdout/stderr like a modern sane person, and let upstart/systemd/docker/etc. push your logs to syslog if that's where it feels like pushing them.)

heavenlyhash11y ago

The OS IPC mechanism quite likely does not use a lock free queue -- or at least, not in quite the same way as I think the grandparent post refers to.

Using a well implemented ring buffer [+] can get enqueue operations down to a few instructions and something like two memory fences.

derefr11y ago

1 more reply

hoov11y ago

> Or use syslog() like a sane person, and let that "extra thread" live inside the OS IPC mechanism.

erikb11y ago

Evan, I would be happy if you could explain more about what you see in the strace outputs. E.g.:

> Time spent in fsync accounts for almost exactly the delay we were seeing.

zorbo11y ago

> Just small additions like the following would help a lot: "[The X seconds] spent in fsync [seen in diagram A] accounts for almost exactly the delay we were seeing [in diagram B]".

Exactly. I try to always point to the actual data when I'm telling my readers my conclusions.

This at least lets the reader refer back to the output and see corresponding entries.

evan_miller11y ago

Ah, thanks for your feedback. I'm never sure how much detail I should dive into. I will add more next time. I'll answer them here just for practice :)

Recall, this is a problem in the 99th percentile. That is, on average our requests are still taking around 20ms to complete -- but one in every hundred or so takes about 200ms.

sargun11y ago

Why would you fsync logs for a high-level service? Are you afraid a power outage is going to cause you to lose service logs?

TheLoneWolfling11y ago

Making sure your logging data gets written is relatively important.

Say, if you hit a bug somewhere that causes the server itself to crash.

frankzinger11y ago

How about installing a SIGSEGV, etc. handler to do the fsync (and perhaps even print a stacktrace)?

Rapzid11y ago

2 more replies

evan_miller11y ago

The fsync() is implicit behavior of the glog framework for Go. We didn't actually want it.

govindkabra3111y ago

Did anyone understand why logging calls on only problem machine had different fsync behavior than normal machines?

evan_miller11y ago

Hi, post author here. The slight difference in i/o (specifically: writes) was the trigger. I talked a little more about that here: http://corner.squareup.com/2014/09/logging-can-be-tricky.htm...

And here: https://news.ycombinator.com/item?id=8359556

unclebucknasty11y ago

Those seemed to be the source of the real issue here.

evan_miller11y ago

For example, here I reproduce the problem on a completely isolated machine: https://news.ycombinator.com/item?id=8359556

2 more replies

kordless11y ago

If you want tricky, try building a logging service and then enable logging to itself.

vacri11y ago

nkozyra11y ago

Does anyone think about doing logging to shared memory / memcached and then committing snapshots to disk at regular intervals via another process/machine?

evan_miller11y ago

The trouble outlined in my blog post is that the logging framework was calling fsync() -- that is, specifically asking to flush the page cache all the way to disk.

jacques_chester11y ago

Have you experimented with a log-structured filesystem? I once saw similar high-variation behaviour in a logging problem on extfs. Switching to NILFS pretty much got rid of variable latency entirely.

evan_miller11y ago

Not recently, no. I do know that a journaled filesystem can exacerbate this sort of problem as it can make extra work. For example: http://lwn.net/Articles/328363/

In a few cases in the past when dealing with unimportant data I have downgraded to ext2 for a nice performance bump.

1 more reply

spydum11y ago

Isn't this essentially syslog-ng?

nkozyra11y ago

Does syslog-ng use shared memory? I thought it communicated via open socket network channels.

spydum11y ago

digikata11y ago

Doe this increase the chance that you lose a bit of critical log if a fault causes the system to go down?

jewel11y ago

viraptor11y ago

Also if your logging plugin / framework / module allows logging to remote syslog, that's a pretty good idea - no disk blocking. Although you are then risking data loss when the network goes down...

canterburry11y ago

Wow...this post made me think I shouldn't even attempt to run my own server infrastructure for my startup. This kind of analysis is way deeper than I'm currently capable of.

rjbwork11y ago

It truly is a joy.

pit11y ago

God bless you -- you are doing the Cloud's work.

oftenwrong11y ago

Curious: How do you persist data with no database servers? Using an outside service to store it?

rjbwork11y ago

kiyoto11y ago

I think, fundamentally, files are the wrong abstraction for log data (as pointed out by Adam Wiggins of Heroku fame: http://12factor.net/logs), but it is something that we are stuck with.

mattmanser11y ago

Unless you're really unlucky/using young tech/bad at programming you won't need that kind of expertise unless you have the kind of traffic which means you can afford to hire that expertise.

mleonhard11y ago

I think this problem is mostly due to using Go, whose libraries are not mature.

j / k navigate · click thread line to collapse