Wasm3 – A high performance WebAssembly interpreter in C (opens in new tab)

(github.com)

186 pointssound_and_form6y ago76 comments

76 comments

42 comments · 7 top-level

CharlesW6y ago· 11 in thread

Why is an interpreter desirable when JIT compilers create significantly faster code? Is this primarily about embedded use?

ridiculous_fish6y ago

1. Some platforms do not have wasm JITs available, because nobody has written one.

2. Some platforms prohibit creating new executable pages, which prevents JITing.

3. Memory savings!

saagarjha6y ago

Plus some platforms allow all of the above, but need something that can run while the JIT warms up.

w0utert6y ago

Any JIT needs an interpreter to run code sections that are not compiled (yet), and large sections (I don't have any statistics, but I would guess a very sizable majority) of the code is 'cold' and will never get compiled at all. Compiling bytecode to machine code is a relatively expensive operation in itself so it only makes sense to do it for 'hot' sections (loops, functions that get called many times, etc).

pjmlp6y ago

There JIT implementations without interpreter phase, .NET being one of them.

1 more reply

Matthias2476y ago

Besides that it can help with startup time. If the script that needs to be executed is a one-off thing the removal of compilation time might be save more time than a JIT could save during execution.

vijaybritto6y ago

From what I know, its because JIT uses a lot more memory than an interpreter. Also in LUAJIT it uses very little memory and thats why people have their minds blown all the time.

kbumsik6y ago

iOS prohibits JIT for example.

saagarjha6y ago

More specifically, apps submitted to the App Store may not utilize JIT.

1 more reply

vshymanskyy6y ago

Right. Wasm3 targets iOS as well

Koshkin6y ago

On the embedded side, I'd rather see a hardware interpreter.

leetrout6y ago

I'm hardware ignorant... would that be a system on a chip?

2 more replies

ridiculous_fish6y ago· 10 in thread

This is pretty exciting if real:

> Bytecode/opcodes are translated into more efficient "operations" during a compilation pass, generating pages of meta-machine code

WASM compiled to a novel bytecode format aimed at efficient interpretation.

> Commonly occurring sequences of operations can can also be optimized into a "fused" operation.

Peephole optimizations producing fused opcodes, makes sense.

> In M3/Wasm, the stack machine model is translated into a more direct and efficient "register file" approach

WASM translated to register-based bytecode. That's awesome!

> Since operations all have a standardized signature and arguments are tail-call passed through to the next, the M3 "virtual" machine registers end up mapping directly to real CPU registers.

This is some black magic, if it works!

vnorilo6y ago

Sure this is neat stuff, but I don't think any of it is novel. Bochs is a good source for some bytecode vm performance wizardry [1], even if the bytecode in question is the x86 ISA.

Regardless, kudos to the authors and nice to see a fast wasm interpreter done well.

1: http://www.emulators.com/docs/nx25_nostradamus.htm

saagarjha6y ago

Yeah, threaded interpretation is nothing new. Notably, Forth compilers have often used it; the iSH x86 emulator (https://github.com/tbodt/ish) is a more recent example of this technique.

vshymanskyy6y ago

It's not only the "threaded code" approach, that makes Wasm3 so fast. In fact, Intel's WAMR also utilizes this method, yet is 30 times slower..

kevingadd6y ago

IR getting converted into an interpreter-oriented bytecode is pretty common. Mono does it for its interpreter and IIRC, Spidermonkey has historically done that as well. I'm not sure if V8 has ever interpreted from an IR (maybe now?) but you could view their original 'baseline JIT' model as converting into an interpreter-focused IR, where the IR just happened to be extremely unoptimized x86 assembly.

Translating the stack machine into registers was always a core part of the model but it's interesting to me that even interpreters are doing it. The necessity of doing coloring to assign registers efficiently is kind of unfortunate, I feel like the WASM compiler would have been the right place to do this offline.

jashmatthews6y ago

> The necessity of doing coloring to assign registers efficiently is kind of unfortunate

Register based VMs like Lua don't do this. The register allocation is incredibly simple https://github.com/LuaJIT/LuaJIT/blob/v2.1/src/lj_parse.c#L3...

1 more reply

saagarjha6y ago

> The necessity of doing coloring to assign registers efficiently is kind of unfortunate

You don't have to do register allocation with coloring; it's just that most implementations do.

uasm6y ago

> "WASM translated to register-based bytecode. That's awesome!"

If the hardware executing this code is "stack-based" (or, does not offer enough general purpose registers to accomodate the funtion call) - this will need to be converted back to a stack-based function call (either at runtime, or beforehand). Wouldn't this intermediate WASM-to-register-based-bytecode translation be redundant then?

tom_mellior6y ago

You seem to be asking something like "if we always hit the algorithm's slow path, isn't the algorithm slow?". The answer is "yes, but we will (hopefully) almost never hit the slow path". On x64-64 you will typically be able to pass 6 integer/pointer values and 8 floating-point values in registers. That should be enough for most function calls in real-world code.

tyingq6y ago

I don't know of any current physical stack machine CPUs.

1 more reply

AlEinstein6y ago

What proportion of machines running web assembly are stack-based, would you say?

I would guess negligible.

29athrowaway6y ago· 7 in thread

The motivation for WebAssembly rather than plain ARM or x86 assembly = portability, security.

It would be interesting to see how this is designed for security in mind.

DarthGhandi6y ago

Pardon my wasm illiteracy here, what exactly makes it more secure?

Struggling to see it.

ridiculous_fish6y ago

WASM has a sandboxing model. The idea is:

1. Control flow is always checked. You can't jump to an arbitrary address, you jump to index N in a control flow table.

2. Calls out of the sandbox are also table based.

3. Indexed accesses are bounds checked. On 64 bit platforms, this is achieved by demoting the wasm to 32 bit and using big guard pages. On 32 bit platforms, it's explicit compares.

The result is something which may become internally inconsistent (can Heartbleed) but cannot perform arbitrary accesses to host memory.

2 more replies

sanxiyn6y ago

Unlike in x86 or ARM assembly, you can't overwrite the return address in WebAssembly.

earenndil6y ago

Sandboxing is trivial, because you can cover all paths to the outside world.

1 more reply

29athrowaway6y ago

All web APIs are designed to be secure in the context of the web.

Needs to be memory safe otherwise a wasm program can execute arbitrary code, access memory that it should not, etc.

kick6y ago

This is designed for microcontrollers. If you're running untrusted code on microcontrollers, you've got bigger problems.

saagarjha6y ago

I wouldn't go as far as to say it's designed for microcontrollers; as others have mentioned, there's a number of potential applications for this in other contexts as well.

setheron6y ago· 3 in thread

The neater article seems to be about M3 interpreter https://github.com/soundandform/m3#m3-massey-meta-machine

Tbh, I couldn't get the eureka moment though. Might try to read in the AM ;)

thermals6y ago

Yeah, this is a good way to design a fast interpreter! It's traditionally called a "threaded interpreter", or (somewhat confusingly) "threaded code":

https://en.wikipedia.org/wiki/Threaded_code

http://www.complang.tuwien.ac.at/forth/threaded-code.html

You can see an example of this particular implementation style (where each operation is a tail call to a C function, passing the registers as arguments) at the second link above, under "continuation-passing style".

One of the big advantages of a threaded interpreter is relatively good branch prediction. A simple switch-based dispatch loop has a single indirect jump at its core, which is almost entirely unpredictable -- whereas threaded dispatch puts a copy of that indirect jump at the end of each opcode's implementation, giving the branch predictor way more data to work with. Effectively, you're letting it use the current opcode to help predict the next opcode!

vshymanskyy6y ago

Yeah, but... It's not only the "threaded code" approach, that makes Wasm3 so fast. In fact, Intel's WAMR also utilizes this method, yet is 30 times slower..

sound_and_formOP6y ago

Thanks for the links! I've long searched google trying to find a similar "tail-cail" interpreter. No wonder I couldn't hit anything -- it was so poorly named! :)

haberman6y ago· 2 in thread

These are impressive performance numbers.

> Because operations end with a call to the next function, the C compiler will tail-call optimize most operations.

It appears that this relies on tail-call optimization to avoid overflowing the stack. Unfortunately this means you probably can't run it in debug mode.

vshymanskyy6y ago

It's not that bad even in debug mode (or without TCO). Just not optimal. Also, there is a way to rework this part, so it does not rely on compiler TCO.

haberman6y ago

If the jump to the next opcode is a tail call, wouldn't an arbitrarily long sequence of instructions take arbitrarily much stack space?

MuffinFlavored6y ago· 1 in thread

> Node v13.0.1 (interpreter) 28 59.5x

https://github.com/wasm3/wasm3/blob/master/test/benchmark/co...

59.5x faster than node.js at what? Executing WebAssembly?

vshymanskyy6y ago

V8 has a built-in (pure, no JIT etc.) interpreter of WASM. Which is quite slow according to this test.

jononor6y ago· 1 in thread

Impressive list of constrained targets for embedded. The AtMega1284 microcontroller for example has only 16 KB of RAM. Which is a lot for an 8-bit micro, but pretty standard for a modern application processors.

vshymanskyy6y ago

Yup. TinyBLE is nRF51 SoC with 16Kb SRAM as well.

j / k navigate · click thread line to collapse

76 comments

42 comments · 7 top-level

CharlesW6y ago· 11 in thread

Why is an interpreter desirable when JIT compilers create significantly faster code? Is this primarily about embedded use?

ridiculous_fish6y ago

1. Some platforms do not have wasm JITs available, because nobody has written one.

2. Some platforms prohibit creating new executable pages, which prevents JITing.

3. Memory savings!

saagarjha6y ago

Plus some platforms allow all of the above, but need something that can run while the JIT warms up.

w0utert6y ago

pjmlp6y ago

There JIT implementations without interpreter phase, .NET being one of them.

1 more reply

Matthias2476y ago

Besides that it can help with startup time. If the script that needs to be executed is a one-off thing the removal of compilation time might be save more time than a JIT could save during execution.

vijaybritto6y ago

From what I know, its because JIT uses a lot more memory than an interpreter. Also in LUAJIT it uses very little memory and thats why people have their minds blown all the time.

kbumsik6y ago

iOS prohibits JIT for example.

saagarjha6y ago

More specifically, apps submitted to the App Store may not utilize JIT.

1 more reply

vshymanskyy6y ago

Right. Wasm3 targets iOS as well

Koshkin6y ago

On the embedded side, I'd rather see a hardware interpreter.

leetrout6y ago

I'm hardware ignorant... would that be a system on a chip?

2 more replies

ridiculous_fish6y ago· 10 in thread

This is pretty exciting if real:

> Bytecode/opcodes are translated into more efficient "operations" during a compilation pass, generating pages of meta-machine code

WASM compiled to a novel bytecode format aimed at efficient interpretation.

> Commonly occurring sequences of operations can can also be optimized into a "fused" operation.

Peephole optimizations producing fused opcodes, makes sense.

> In M3/Wasm, the stack machine model is translated into a more direct and efficient "register file" approach

WASM translated to register-based bytecode. That's awesome!

> Since operations all have a standardized signature and arguments are tail-call passed through to the next, the M3 "virtual" machine registers end up mapping directly to real CPU registers.

This is some black magic, if it works!

vnorilo6y ago

Sure this is neat stuff, but I don't think any of it is novel. Bochs is a good source for some bytecode vm performance wizardry [1], even if the bytecode in question is the x86 ISA.

Regardless, kudos to the authors and nice to see a fast wasm interpreter done well.

1: http://www.emulators.com/docs/nx25_nostradamus.htm

saagarjha6y ago

Yeah, threaded interpretation is nothing new. Notably, Forth compilers have often used it; the iSH x86 emulator (https://github.com/tbodt/ish) is a more recent example of this technique.

vshymanskyy6y ago

It's not only the "threaded code" approach, that makes Wasm3 so fast. In fact, Intel's WAMR also utilizes this method, yet is 30 times slower..

kevingadd6y ago

jashmatthews6y ago

> The necessity of doing coloring to assign registers efficiently is kind of unfortunate

Register based VMs like Lua don't do this. The register allocation is incredibly simple https://github.com/LuaJIT/LuaJIT/blob/v2.1/src/lj_parse.c#L3...

1 more reply

saagarjha6y ago

> The necessity of doing coloring to assign registers efficiently is kind of unfortunate

You don't have to do register allocation with coloring; it's just that most implementations do.

uasm6y ago

> "WASM translated to register-based bytecode. That's awesome!"

tom_mellior6y ago

tyingq6y ago

I don't know of any current physical stack machine CPUs.

1 more reply

AlEinstein6y ago

What proportion of machines running web assembly are stack-based, would you say?

I would guess negligible.

29athrowaway6y ago· 7 in thread

The motivation for WebAssembly rather than plain ARM or x86 assembly = portability, security.

It would be interesting to see how this is designed for security in mind.

DarthGhandi6y ago

Pardon my wasm illiteracy here, what exactly makes it more secure?

Struggling to see it.

ridiculous_fish6y ago

WASM has a sandboxing model. The idea is:

1. Control flow is always checked. You can't jump to an arbitrary address, you jump to index N in a control flow table.

2. Calls out of the sandbox are also table based.

3. Indexed accesses are bounds checked. On 64 bit platforms, this is achieved by demoting the wasm to 32 bit and using big guard pages. On 32 bit platforms, it's explicit compares.

The result is something which may become internally inconsistent (can Heartbleed) but cannot perform arbitrary accesses to host memory.

2 more replies

sanxiyn6y ago

Unlike in x86 or ARM assembly, you can't overwrite the return address in WebAssembly.

earenndil6y ago

Sandboxing is trivial, because you can cover all paths to the outside world.

1 more reply

29athrowaway6y ago

All web APIs are designed to be secure in the context of the web.

Needs to be memory safe otherwise a wasm program can execute arbitrary code, access memory that it should not, etc.

kick6y ago

This is designed for microcontrollers. If you're running untrusted code on microcontrollers, you've got bigger problems.

saagarjha6y ago

I wouldn't go as far as to say it's designed for microcontrollers; as others have mentioned, there's a number of potential applications for this in other contexts as well.

setheron6y ago· 3 in thread

The neater article seems to be about M3 interpreter https://github.com/soundandform/m3#m3-massey-meta-machine

Tbh, I couldn't get the eureka moment though. Might try to read in the AM ;)

thermals6y ago

Yeah, this is a good way to design a fast interpreter! It's traditionally called a "threaded interpreter", or (somewhat confusingly) "threaded code":

https://en.wikipedia.org/wiki/Threaded_code

http://www.complang.tuwien.ac.at/forth/threaded-code.html

vshymanskyy6y ago

Yeah, but... It's not only the "threaded code" approach, that makes Wasm3 so fast. In fact, Intel's WAMR also utilizes this method, yet is 30 times slower..

sound_and_formOP6y ago

Thanks for the links! I've long searched google trying to find a similar "tail-cail" interpreter. No wonder I couldn't hit anything -- it was so poorly named! :)

haberman6y ago· 2 in thread

These are impressive performance numbers.

> Because operations end with a call to the next function, the C compiler will tail-call optimize most operations.

It appears that this relies on tail-call optimization to avoid overflowing the stack. Unfortunately this means you probably can't run it in debug mode.

vshymanskyy6y ago

It's not that bad even in debug mode (or without TCO). Just not optimal. Also, there is a way to rework this part, so it does not rely on compiler TCO.

haberman6y ago

If the jump to the next opcode is a tail call, wouldn't an arbitrarily long sequence of instructions take arbitrarily much stack space?

MuffinFlavored6y ago· 1 in thread

> Node v13.0.1 (interpreter) 28 59.5x

https://github.com/wasm3/wasm3/blob/master/test/benchmark/co...

59.5x faster than node.js at what? Executing WebAssembly?

vshymanskyy6y ago

V8 has a built-in (pure, no JIT etc.) interpreter of WASM. Which is quite slow according to this test.

jononor6y ago· 1 in thread

vshymanskyy6y ago

Yup. TinyBLE is nRF51 SoC with 16Kb SRAM as well.

j / k navigate · click thread line to collapse