X86 MMU fault handling is turing complete (opens in new tab)

(github.com)

417 pointsmman13y ago39 comments

39 comments

25 comments · 11 top-level

tptacek13y ago· 7 in thread

This is more or less the greatest thing I've learned about in the last couple years.

What's happening here is that they're getting computation without executing any instructions, simply through the process of using the MMU hardware to "resolve addresses". The page directory system has been set up in such a way that address resolution effects a virtual machine that they can code to.

This works because when you attempt to resolve an invalid address, the CPU generates a trap (#PF), and the handling of that trap pushes information on the "stack". Each time you push data to the stack, you decrement the stack pointer. Eventually, the stack pointer underflows; when that happens, a different trap (#DF) fires. This mechanism put together gives you:

    if x < 4 { goto b } else { x = x - 4 ; goto a }

also known as "subtract and branch if less than or equal to zero", also known as "an instruction adequate to construct a one-instruction computer".

The virtual machine "runs" by generating an unending series of traps: in the "goto a" case, the result of translation is another address generating a trap. And so on.

The details of how this computer has "memory" and addresses instructions is even headachier. They're using the x86 TSS as "memory" and for technical reasons they get 16 slots (and thus instructions) to work with, but they have a compiler that builds arbitrary programs into 16-colored graphs to use those slots to express generic programs. Every emulator they could find crashes when they abuse the hardware task switching system this way.

Here's it running Conway's Life:

http://youtubedoubler.com/?video1=E2VCwBzGdPM&start1=0&#...

Here's their talk for a few months back:

http://www.youtube.com/watch?v=NGXvJ1GKBKM

The talk is great, but if you're not super interested in X86/X64 memory corruption countermeasures, you might want to skip the first 30 minutes.

dbarlett13y ago

Reminds me of "ICMP delay-line memory": http://stackoverflow.com/questions/12748246/sorting-1-millio...

0x013y ago

The slides in the github repo ( https://github.com/jbangert/trapcc/blob/master/slides/PFLA-s... ) also have a few interesting points, like "No publicly available simulator implements this correctly" (how did they record the youtube video?) and a few vague hints about exploiting this for doing VM escapes.

jey13y ago

> "No publicly available simulator implements this correctly" (how did they record the youtube video?)

Probably by patching the emulator.

calt13y ago

> how did they record the youtube video?

By running it on a physical machine. Unless there is a requirement that the processor not be multitasking that I am missing.

1 more reply

tptacek13y ago

By fixing bugs in Bochs?

jd00713y ago

Isn't it a bit misleading to say that they are getting computation without executing any instructions? It's true that they are not executing any x86 instructions from the CPU, but the MMU is doing all the work by executing its instructions of address resolution.

Actually is it even true that they are not executing any x86 instructions on the CPU? From my understanding, the handling of page faults needs the CPU to execute some instructions. Maybe I'm wrong, if so could you enlighten me? Thanks.

simias13y ago

I'm not very familiar with the x86 architecture, but usually when there's a fault the CPU generally attempts to lookup the address of a "callback" function in an interrupt/fault vector.

I suppose if you setup everything very carefully you can make it fault over and over again without giving it the time to execute any instruction.

Without looking into the specifics, I think it's very possible that the CPU is not actually executing any instructions, just waiting for the MMU to get a hold of itself. After all, in order to simply load the instructions you need the MMU to be responsive (or deactivated I suppose, if there's such a thing as no-MMU x86).

1 more reply

codex13y ago· 4 in thread

Another place for root kits to hide.

tomrod13y ago

Could one write a preemptive rootkit that sniffs for other rootkits?

Would that slow things down incredibly?

simias13y ago

That's called an antivirus :)

iamrohitbanga13y ago

can you please elaborate?

jpollock13y ago

It's computation that you can't see with a debugger, or with any sort of tracing.

1 more reply

jbangert13y ago· 1 in thread

Author here: While it is true that with the current implementation, memory access is extremely limited (essentially one DWORD per page, or about 0.1% of the available physical RAM) that limitation can certainly be avoided. For one, you could shift how the TSS is aligned (and align them differently for different instructions), multiplying your address space by a factor of 10 or so. Furthermore, you could also place another TSS somewhere in memory (only a few of the variables need to actually contain sane values) with an invalid EIP and use that as a 'load' instruction.

The easiest way however would be to use the TrapCC mechanism to transfer control between bits of normal assembler code (perhaps repurposed from other functions already in your kernel), doing something similar to ROP. Of course, for additional fun, feel free to throw in BX's Brainfuck interpreter in ELF and James Oakley's DWARF exception handler. We might drop a demo of this soon, i.e. implementing a self-decrypting binary via page faults.

sounds13y ago

"memory access is extremely limited (essentially one DWORD per page" – referring to non-code addresses, yes? In the current (simplest) implementation, each instruction (a TSS) must be aligned across a page boundary. You do comment below that altering alignment could increase the available code space.

I'm wondering what method PFLA uses to read/write non-code addresses. Only one address per page can be addressed? I'll take a look at the compiler.

By simply expanding the addressing capability, a very tiny program could emulate an instruction stream from memory, overcoming the limited code space (at the cost of execution speed).

Cheers!

ars13y ago· 1 in thread

How fast (slow) is this relative to the host CPU?

simias13y ago

Probably incredibly slow given the reduced instruction set and that it relies on context switches/pushing stuff on the stack for functioning.

traxtech13y ago· 1 in thread

That the hardware version of the brainfuck philosophy.

switch3313y ago

Best explanation ever. I second this. lol

networked13y ago

>Move, Branch if Zero, Decrement

This is basically the canonical instruction for OISCs (one instruction set computers). Wikipedia describes it pretty well: https://en.wikipedia.org/wiki/One_instruction_set_computer#S....

majke13y ago

There was a talk on 29c3 about this. Abstract:

https://events.ccc.de/congress/2012/Fahrplan/events/5265.en....

video:

https://www.youtube.com/watch?v=NGXvJ1GKBKM

rocky113813y ago

This is really interesting. In a way, it's a form of computer self-replication. Could the virtual machine created by the computer be considered offspring?

Is there a way the virtual machine might spawn another virtual machine child of its own?

ithkuil13y ago

if you like this kind of things there is also:

http://www.cs.dartmouth.edu/~bx/elf-bf-tools/slides/ELF-berl...

conductor13y ago

Expect this technique in the future malwares and software protection DRM systems for making code analyzing harder.

general_failure13y ago

somebody checked in vim backup files :-)

j / k navigate · click thread line to collapse

39 comments

25 comments · 11 top-level

tptacek13y ago· 7 in thread

This is more or less the greatest thing I've learned about in the last couple years.

    if x < 4 { goto b } else { x = x - 4 ; goto a }

also known as "subtract and branch if less than or equal to zero", also known as "an instruction adequate to construct a one-instruction computer".

The virtual machine "runs" by generating an unending series of traps: in the "goto a" case, the result of translation is another address generating a trap. And so on.

Here's it running Conway's Life:

http://youtubedoubler.com/?video1=E2VCwBzGdPM&start1=0&#...

Here's their talk for a few months back:

http://www.youtube.com/watch?v=NGXvJ1GKBKM

The talk is great, but if you're not super interested in X86/X64 memory corruption countermeasures, you might want to skip the first 30 minutes.

dbarlett13y ago

Reminds me of "ICMP delay-line memory": http://stackoverflow.com/questions/12748246/sorting-1-millio...

0x013y ago

jey13y ago

> "No publicly available simulator implements this correctly" (how did they record the youtube video?)

Probably by patching the emulator.

calt13y ago

> how did they record the youtube video?

By running it on a physical machine. Unless there is a requirement that the processor not be multitasking that I am missing.

1 more reply

tptacek13y ago

By fixing bugs in Bochs?

jd00713y ago

simias13y ago

I'm not very familiar with the x86 architecture, but usually when there's a fault the CPU generally attempts to lookup the address of a "callback" function in an interrupt/fault vector.

I suppose if you setup everything very carefully you can make it fault over and over again without giving it the time to execute any instruction.

1 more reply

codex13y ago· 4 in thread

Another place for root kits to hide.

tomrod13y ago

Could one write a preemptive rootkit that sniffs for other rootkits?

Would that slow things down incredibly?

simias13y ago

That's called an antivirus :)

iamrohitbanga13y ago

can you please elaborate?

jpollock13y ago

It's computation that you can't see with a debugger, or with any sort of tracing.

1 more reply

jbangert13y ago· 1 in thread

sounds13y ago

I'm wondering what method PFLA uses to read/write non-code addresses. Only one address per page can be addressed? I'll take a look at the compiler.

By simply expanding the addressing capability, a very tiny program could emulate an instruction stream from memory, overcoming the limited code space (at the cost of execution speed).

Cheers!

ars13y ago· 1 in thread

How fast (slow) is this relative to the host CPU?

simias13y ago

Probably incredibly slow given the reduced instruction set and that it relies on context switches/pushing stuff on the stack for functioning.

traxtech13y ago· 1 in thread

That the hardware version of the brainfuck philosophy.

switch3313y ago

Best explanation ever. I second this. lol

networked13y ago

>Move, Branch if Zero, Decrement

This is basically the canonical instruction for OISCs (one instruction set computers). Wikipedia describes it pretty well: https://en.wikipedia.org/wiki/One_instruction_set_computer#S....

majke13y ago

There was a talk on 29c3 about this. Abstract:

https://events.ccc.de/congress/2012/Fahrplan/events/5265.en....

video:

https://www.youtube.com/watch?v=NGXvJ1GKBKM

rocky113813y ago

This is really interesting. In a way, it's a form of computer self-replication. Could the virtual machine created by the computer be considered offspring?

Is there a way the virtual machine might spawn another virtual machine child of its own?

ithkuil13y ago

if you like this kind of things there is also:

http://www.cs.dartmouth.edu/~bx/elf-bf-tools/slides/ELF-berl...

conductor13y ago

Expect this technique in the future malwares and software protection DRM systems for making code analyzing harder.

general_failure13y ago

somebody checked in vim backup files :-)

j / k navigate · click thread line to collapse