Not the first time I've seen that by a long shot.
//shakes zimmerframe, shuffles off
(In languages that support macros, you could also just make debug_log() a macro that gets compiled out on release builds, like assert(). That means a whole bunch of object files are going to be functionally different between debug and release builds, though, which is annoying for a lot of operational reasons.)
My favourite solution by far, though, is the one recently implemented by Elixir: log() is a regular function, but takes a closure instead of a string. That way, you can actually put all the work required to generate the log message (inspecting data structures, doing intermediate lookups, etc.) inside the closure, and it all gets elided when you decrease the log-level.
log.debug("I created this gigantic object {}",
myBigUglyObject)But that's a problem with how you choose to use logging in your application - not with logging in general or specific logging features in Java.
Logging used without any care can cause lots of problems - but then so can any part of an application. Used with some care it can be a very useful tool for helping to diagnose all kinds of problems.
This is a pretty fantastic route as far as syntactical simplicity goes.
I wonder if it can have an implications on the generated assembly nonetheless? Could there be extra assembly generated outside the branch for logging level check, which is necessary to bind the variables for the closure?
Similarly, I might suspect that an optimizing compiler that is considering a function for inlining might change its mind for any function where it sees a nontrivial closure like that, without knowing that "trace" level logging nearly never happens.
"Edge cases", to be sure, and the answers are surely different between languages, but interesting to consider nonetheless.
log("debug", "I found: " + foo.toString());
you end up doing a lot of work to create that string, only to have it be ignored by the log call. Whereas if you say: log("debug", => "I found: " + foo.toString());
any extra assembly you execute (or generate) to make that work is peanuts compared to avoiding the call to toString(). None of that (besides the simple thing of log levels) helps with the OP's problem with IO, but it can make a real difference to your CPU.A pretty common halfway solution (used by Python's logging for instance) is that the logging library takes a format string and its parameters, and does the formatting itself.
Doesn't save intermediate lookups or complex computations if they exist, but for the common case of a complex stringification it can save your bacon.
The alternative is an explicit conditional around the logging call, which is definitely worse than the ability to log a closure.
I've never used Elixir but toyed with Erlang many years ago - is it called asynchronously by the logging process in Elixir? If so, I guess you're relying on immutable data structures to make sure you get the correct logging output?
This gives you another neat benefit, nearly for free: if you have, say, application Foo on node A, wanting to log to SyslogSink on node B, then Foo first sends its logs to a gen_event on A, and the SyslogSink handler in that gen_event gets the opportunity to drop messages, using whatever logic SyslogSink likes, before forwarding them over to B (which might be half-way across the Internet.)
Took ages to work that out.
Ugh.
/s
Turns out it was the JVM provided GC logging hanging on flush (not even fsync) calls. The flush call was during GC, and while the GC implementation held a stop the world lock. Digging through JVM source code is 'fun'.
If you record system metrics (eg. to ganglia) then you can also attempt to correlate large pauses to a large and rapidly declining number of dirty bytes in the fs cache.
When they discover the actual problem is some issue with the raid controller, I promise not to say "I told you so".
Unrelated write activity on a filesystem can cause cause fsync() calls in any other process to vary wildly in latency. This can be replicated, here's an experiment for you. First, run this:
strace -T -efsync ruby -e'loop { STDOUT.fsync; puts "a" * 120; sleep 0.1 } ' > ~/somefile
Then, in another terminal do a little bit of writing -- make sure it is on the same filesystem. For example:
dd if=/dev/zero of=~/someotherfile bs=4M count=1
On my poor little aws VM, here is what I see:
fsync(1) = 0 <0.025072>
fsync(1) = 0 <3.930661>
fsync(1) = 0 <0.024810>
That is, writing 4 megabytes in an unrelated process caused fsync() to jump two orders of magnitude.
Removing fsync() is an appropriate fix because we don't really ever want to flush this data to durable storage.
20% utilization on a resource that can only do 100 random operations per second is a major problem, and way different from 5%.
A piece of hardware that operates at 100 Hz, with 20% utilization, will block for about 135ms+ for the 95th percentile request. With 5% utilization, it will block for about 10ms for the 95% percentile request.
My quick calculations are somewhat below the 200ms discrepancy you show in the chart, but not far off.
Of course, turning off fsync is a perfectly good solution. Longer term, I would move to SSDs and just make this entire class of problem go away. I don't even have any spinning platters to test your strace on.
Most programming abstractions and tools (such as Logging / UI design / persistence) across instances are all "easy" when it comes to one machine (or a few machines) in a closed, deterministic environment.
All of these become extremely tricky when you have to deal with many instances across many different possibly indeterministic environments. For UI, it quickly becomes what's the viewport? What's the real estate? How is interaction (touch? mouse? keyboard? numpad?). For persistence, one has to worry about number of simultaneous connections, character encoding and normalization, read / writes per second, synchronisation, etc..
So, also for logging - Are we logging client side or over the network? what's the memory on the client? What else is running on the client? What's the permissions through which we log? Do we care about a system call on this client given any user interaction? etc..
Logging, UI, Persistence, Network protocols, heck even just choosing an intermediate data structure are all tricky .. at scale across devices and indeterministic environments.
Or use syslog() like a sane person, and let that "extra thread" live inside the OS IPC mechanism.
(Or stdout/stderr like a modern sane person, and let upstart/systemd/docker/etc. push your logs to syslog if that's where it feels like pushing them.)
Using a well implemented ring buffer [+] can get enqueue operations down to a few instructions and something like two memory fences.
The overhead of IPC, which wakes up the kernel scheduler, switches the processor back and forth between privilege modes a few times on the way, knocks apart all the CPU cache and register state to swap in another process, while the MMU is flipping all your pages around because these two processes don't trust each other to write directly into their respective memory... is not going to have quite the same performance characteristics.
An moment in the history of logging is java's log4j framework, which, within a single process, used exclusive synchronization. When this was replaced by a (relatively) lock-free queue implementation, throughput increased by orders of magnitude. (Their notes and graphs on this can be found at https://logging.apache.org/log4j/2.x/manual/async.html .) This isn't an exact metaphor for the difference between a good lockfree ringbuffer and IPC either, but it certainly has some similarities, and indeed ends with a specific shout-out to the power of avoiding "locks requiring kernel arbitration".
--
[+] The "mechanical sympathy" / Disruptor folks have some great and accessible writeups on how they addressed the finer points of high performance shared memory message passing. http://mechanitis.blogspot.com/2011/06/dissecting-disruptor-... is one of my favorite reads.
As it turns out, we had nearly the same symptoms as the OP. After running strace on syslogd (CentOS 5), I realized that it also does an fsync() after every call. A quick trip to the man page and a configuration change later, our issues disappeared.
> Time spent in fsync accounts for almost exactly the delay we were seeing.
What delay? I see the whole thing taking 1.5 seconds and 1.3 seconds spent in futex (0.4 more than on the normal host). Not sure, why we are suddenly talking about fsync. I also don't know what either method (futex, fsync) could be doing.
All these are not questions I want answers to (some stuff I could google of course). I just want to show that it's a rather confusing read for some readers if you expect them to understand the strace outputs as well as you do, when you seem to be using that tool on a daily basis and the readers might not have used it at all, ever. It would be great to follow your insides better. Just small additions like the following would help a lot: "[The X seconds] spent in fsync [seen in diagram A] accounts for almost exactly the delay we were seeing [in diagram B]".
Exactly. I try to always point to the actual data when I'm telling my readers my conclusions.
"Calls to futex, epoll_wait and select are largely the same on both systems. The problem host does 1727 calls to fsync(), while the normal host does only 849 calls. The problem host spends 1.24% of its time in fsync(). The normal host only 0.48%. This makes fsync() our number 1 suspect."
This at least lets the reader refer back to the output and see corresponding entries.
Recall, this is a problem in the 99th percentile. That is, on average our requests are still taking around 20ms to complete -- but one in every hundred or so takes about 200ms.
The first strace output shows a summary table of time, including both our fast 20ms queries and the 1% of slow 200ms queries. In the summary view I am looking not simply at the largest number in the summary but rather indications that my problematic system is spending time doing something that my healthy system is not. Time spent is a zero-sum equation -- the app is either doing what it's supposed to, or not. futex() is where it should be spending time while idle. fsync() time doubles proportionally, which means it's taking time away from the otherwise productive (or, idle) operations. Which means it's potentially the problem.
When I trace the fsync call individually I see a time delta roughly on par with my per-request delay for the slow 1% of queries. Which indicates to me that this call is happening during the slow queries, accounting for the delay in its entirety.
Say, if you hit a bug somewhere that causes the server itself to crash.
And here: https://news.ycombinator.com/item?id=8359556
Even on the problematic host we only saw this latency issue in the 99th percentile. That is: even on the problem host 99 out of 100 queries were served as expected and only 1 out of 100 saw this additional latency.
I think your point still stands about logging, being cautious about blocking I/O calls, etc. But, it seems the bigger point is one of how your overall system is architected, which proccesses run where, dedicating like nodes to their tasks vs. potential quality/consistency issues arising from having some pull double-duty, etc.
Those seemed to be the source of the real issue here.
For example, here I reproduce the problem on a completely isolated machine: https://news.ycombinator.com/item?id=8359556
If you're not all that concerned about consistency, each web server can keep their logs in a segregated memory space and then another process can combine/commit and send a flush command, leaving the primary machines relatively unencumbered.
The trouble outlined in my blog post is that the logging framework was calling fsync() -- that is, specifically asking to flush the page cache all the way to disk.
In a few cases in the past when dealing with unimportant data I have downgraded to ext2 for a nice performance bump.
It's been a few years since I looked at this, but it seems like there's only a small window that data will stay in the buffer before it will get automatically flushed. The nice thing about this automatic flush is that it won't cause the application to block.
It truly is a joy.
I also forgot to mention we have a build server, but thats just a virtual machine that we backup regularly. It's our only piece of static infrastructure. Everything else, except storage which is, again, transparent to us, is destroyed on a weekly, if not daily, basis for new releases.
I think, fundamentally, files are the wrong abstraction for log data (as pointed out by Adam Wiggins of Heroku fame: http://12factor.net/logs), but it is something that we are stuck with.