I feel like I've learned a lot from reading his blog over the years. I even bought his book years ago because I felt like I was getting a lot of value from the blog.
It's really too bad Microsoft doesn't seem to value backwards compatibility as much as they did during the times Chen often writes about. It seems like an interesting challenge that they've pretty much given up on. I can't even count how many conversations I've been in where people complained on one hand that Microsoft focused on that backwards compatibility too much and on the other that their driver from 2001 doesn't work right in Windows 7. Often these statements happen moments apart.
Driver compatibility is less of an issue; older hardware becomes less common (and thus less of an issue) over time, and strictly adhering to kernel compatibility limits your ability to add new features to the kernel significantly.
This one in particular seemed worthy of sharing because of a couple of things. Firstly, if you just saw the artifact and not the reasoning, you would probably see a lot of WTF in having 7 bytes of NOP-type instructions, with five outside and two inside each function; it's a great reveal when you find out what it is for. Then my reaction was "Wow - I don't really design good instrumentation/logging/debugging points in my code, do I? I wonder if there's any part of this idea I can fruitfully rip off for my own code?" It was a nice one-two punch.
(I could tell quite a few Raymond stories from when we worked together but it is better to leave it to him. It is a shame that he sticks to technical topics mostly on his blog.)
> It's really too bad Microsoft doesn't seem to value backwards compatibility as much as they did during the times Chen often writes about.
This, I'd disagree on. At a certain point, one either has to move on, or to spend 90% of the time working on backward compatibility.
> on the other that their driver from 2001 doesn't work right in Windows 7
Drivers were never really part of the backward compatibility obsession. They changed utterly between 9x and NT, of course, and also fairly dramatically DOS/Win3.x and 9x, between NT4 and 2000, and between XP and Vista.
What you do in Detours is, freeze the process, disassemble the first several instructions of the function you want to hook, copy out enough of them to make room for a full jump instruction, copy in your hook function somewhere in memory, followed by the instructions you stole to make room for the jump, followed by a jump back to the original function. Then you patch in a jump to that location and unfreeze the process.
The example programs for Detours do this, for instance, on every libc function to implement library tracing.
That this "just works" with Microsoft's Detours package is kind of mindboggling.
This is a great project to tackle if you want to write programmable debuggers. We've done it for Win32 (you need a full build environment to use Detours; we have the whole thing in Ruby), OS X, and Linux. It's crazy useful.
An anecdote: I've got a Sony VAIO Z series laptop, one of the 2010 models with "Switchable Hybrid" graphics -- that is, there's a switch marked "Auto"/"Speed"/"Stamina" which you can use to switch between the embedded Intel GPU and the discrete nVidia GPU. The laptop itself is great -- probably the best developer's laptop I've ever seen/used -- but drivers have always been a real pain. Anyway, as it turns out, I was updating the drivers last week and just happened to notice the Detours DLL within the driver installer files; so it seems that the graphics driver actually just checks the position of the switch and uses Detours to direct any calls to the "real" driver for whatever hardware is selected.
That was the most entertaining/frustrating (depending on your world view; it made the journey way more interesting, but also a lot longer) part of hacking classic Mac OS.
You would have a zillion extensions (apple and third party) each patch tens of OS calls, both at startup and, in cases where the Finder reset patches, after the Finder launched, or even after every application launched (to get your code running at such times, you would have to patch another OS call). On PowerPC machines you would have the added fun of patching PPC code with 68k code and vice versa.
That that _sometimes_ worked was really mind boggling. Relative to that, patching your own libc seems easy.
I believe Detours patches the import table inside a single process, it does not patch the call on a system wide basis, so there really is no 'two different programs each patching the same function'. In theory you could have two different pieces of code running in the same process doing that (i.e. each patching a given function), but Detours gives you a 'trampoline' function to invoke the original thing you are patching so I believe the second to patch, when invoking the trampoline function would simply invoke the first patch, which when invoking its own trampoline would invoke the original, though I haven't tried that so it may not work that way :)
VS uses this mechanism allow you to run as non-admin even though there is TONS of VS and third party code that expects to do things like write to HKLM which is a no-no if you are not an admin.
http://software.intel.com/en-us/articles/intercepting-system...
Now that I think about it I'm even more surprised it worked in the first place. I thought Linux had w^x protection?
1) In the comments Raymond says, "Hot-patching is not an application feature. It's an OS internal feature for servicing." Then why does the compiler put hot-patch points in my code? Why not use a special compiler flag when building Windows DLLs?
2) Why do we need a special hot-patch point at all? What's wrong with just overwriting the first few bytes of the function you want to hot-patch?
(It also needs to be possible to overwrite the patch target with 1 instruction, which isn't possible for a far JMP as they are 5 bytes in length.)
2) I think he addressed this - someone might be executing the function while you are trying to patch it. Having a 2-byte, one clock cycle NOP at the front means that you can replace it "atomically" from the perspective that nobody can walk into the middle of you updating the memory.
This is part of the reason that JIT is not allowed, and will not even _work_, on iOS (with the exception of Mobile Safari's Javascript engine, which has special privileges). Applications aren't allowed set the ARM NX equivalent.
For example, from the V8 sources: http://www.google.com/codesearch#W9JxUuHYyMg/trunk/src/platf...
You can use xxd and objdump to see which all of these translate into. For example, here's an 8-byte nop for x86-64:
$ echo '0x0f,0x1f,0x84,0x00,0x00,0x00,0x00,0x00' | xxd -r > /tmp/bincode
$ objdump -M intel -D -b binary -mi386 -Mx86-64 /tmp/bincode
/tmp/bincode: file format binary
Disassembly of section .data:
00000000 <.data>:
0: 0f 1f 84 00 00 00 00 nop DWORD PTR [rax+rax*1+0x0]
7: 00https://lwn.net/Articles/264029/
The mcount feature piggybacks on the profiling instruction added into every function when you use the gcc -pg option.
Edit: better link is probably this one: http://www.mjmwired.net/kernel/Documentation/trace/ftrace.tx...