Show HN: I wrote a WebAssembly Interpreter and Toolkit in C (opens in new tab)

(github.com)

105 points49843y ago28 comments

28 comments

17 comments · 8 top-level

Octokiddie3y ago· 4 in thread

Any ideas on why miniwasm performs better on all the benchmarks except "trap," on which it performs decidedly worse?

4984OP3y ago

The benchmarks were run on MacOS, and actually execute an interrupt for debugging, MacOS then checks if the process is being debugged. Wasm3 just exit(1) and prints a message.

And as to why the rest are faster, I spent much time optimizing the interpreter and learning what the best way to write interpreters is. Its mostly jump threading and Mixed Data.

titzer3y ago

I found that most Wasm interpreters are not particularly good at calls. Wizard is not as fast as wasm3 or wamr in raw speed, but is much faster on calls, particularly because it does not copy arguments (value stacks can be overlapped). But Wizard's primary motivation is to be memory efficient, so it interprets in-place. It also supports instrumentation.

Nice work!

fwsgonzo3y ago

Don't take this as anything other than speculation: I wonder if wasm3 is using musttail with opaque function calls in the instruction handlers. It will demolish performance, which is why I am only using computed gotos in mine (when available). Even switch-case is faster than musttail when you have to leave the tco-jumps. Which is (as an example) why one should not measure performance by fibonacci number generation. :)

haberman3y ago

> I wonder if wasm3 is using musttail with opaque function calls in the instruction handlers. It will demolish performance, which is why I am only using computed gotos in mine (when available). Even switch-case is faster than musttail when you have to leave the tco-jumps.

This doesn't match with my experience. After working on this problem a lot, I came to the conclusion that musttail with opaque function calls is one of the best ways of getting good code out of the compiler: https://blog.reverberate.org/2021/04/21/musttail-efficient-i...

1 more reply

4984OP3y ago· 3 in thread

I made Web49 because there are not many good tools for WebAssembly out there. WABT is close, but the interpreter is too slow and the tools megabytes in size each. Wasm3 is a bit faster but only contains an interpreter, nothing else.

Tooling for WebAssembly is held mostly by the browser vendors. It is such a nice format to work with when one removes all the fluff. WebAssembly tooling should not take seconds to do what should take milliseconds, and it should be able to be used as a library, not just a command line program.

I developed a unique way to write interpreters based on threaded code jumps and basic block versioning when I made MiniVM (https://github.com/FastVM/minivm). It was both larger and more dynamic than WebAssembly. Web49 started as a way to compile WebAssembly to MiniVM, but soon pivoted into its own Interpreter and tooling. I could not be happier with it in its current form and am excited to see what else It can do, with more work.

haberman3y ago

> I developed a unique way to write interpreters based on threaded code jumps and basic block versioning when I made MiniVM (https://github.com/FastVM/minivm). It was both larger and more dynamic than WebAssembly.

I'd be very interested to read more about this. It looks like you are using "one big function" with computed goto (https://github.com/FastVM/Web49/blob/main/src/interp/interp....). My experience working on this problem led me to the same conclusion as Mike Pall, which is that compilers do not do well with this pattern (particularly when it comes to register allocation): http://lua-users.org/lists/lua-l/2011-02/msg00742.html

I'm curious how you worked around the problem of poor register allocation in the compiler. I've come to the conclusion that tail calls are the best solution to this problem: https://blog.reverberate.org/2021/04/21/musttail-efficient-i...

wahern3y ago

> that compilers do not do well with this pattern

As compared to hand-written assembly or the tailcall technique you describe. But (for the benefit of onlookers) a threaded switch, especially using (switch-like) computed gotos, is still more performant than a traditional function dispatch table.

Has there been any movement in GCC wrt the tailcalls feature?

One of the limitations with computed gotos is the inability to derive the address of a label from outside the function. You always end up with some amount of superfluous conditional code for selecting the address inside the function, or indexing through a table. Several years ago when exploring this space I discovered a hack, albeit it only works with GCC (IIRC), at least as of ~10 years ago. GCC supports inline function definitions, inline functions have visibility to goto labels (notwithstanding that you're not supposed to make use of them), and most surprisingly GCC also supports attaching __attribute__((constructor)) to inline function definitions. This means you can export a map of goto labels that can be used to initialize VM data structures, permitting (in theory) more efficient direct threading.

The tailcall technique is a much more sane and profitable approach, of course.

1 more reply

naasking3y ago

> My experience working on this problem led me to the same conclusion as Mike Pall, which is that compilers do not do well with this pattern

Note that that message is from twelve years ago. A lot's changed since then, not just in compilers but in CPUs. Branch prediction is a lot better now.

2 more replies

habibur3y ago· 2 in thread

I am afraid it doesn't work on linux.

heleninboodler3y ago

Worked flawlessly for me on linux. The readme should mention 'emcc' and 'wasm3' as prerequisites for following the bench.py instructions, though.

thechao3y ago

I'm on macOS, and if I do this:

    > git clone ...
    > make -j
    > ./bin/wasm2wat ./test/core/address.wast

I get:

    >>> ./bin/wat2wasm test/core/address.wast 
    unexpected word: `` byte=256

cornstalks3y ago

Can this share memory with the host? wasm3 doesn't allow this[1] and requires you to allocate VM memory and pass it to the host, but that has several downsides (some buffers come from external sources so this requires a memcpy; the VM memory location isn't stable so you can't store a pointer to it on the host; etc.).

I'm really interested in a fast interpreter-only Wasm VM that can allow the host to share some of its memory with the VM.

[1]: https://github.com/wasm3/wasm3/issues/114

mouse_3y ago

I'd love to see a benchmark of this vs. libwasm https://github.com/SerenityOS/serenity/tree/master/Userland/...

duped3y ago

How does it compare to Wasmtime? Do you have a list of supported wasm features?

vmafficianado3y ago

Impressive!

Will work stop on MiniVM as result of Web49?

muricula3y ago

Has this been fuzz tested? Fuzz testing is one of the best ways of discovering security vulnerabilities in C code, and tools like libfuzzer and asan or AFL make it easier than ever.

j / k navigate · click thread line to collapse

28 comments

17 comments · 8 top-level

Octokiddie3y ago· 4 in thread

Any ideas on why miniwasm performs better on all the benchmarks except "trap," on which it performs decidedly worse?

4984OP3y ago

The benchmarks were run on MacOS, and actually execute an interrupt for debugging, MacOS then checks if the process is being debugged. Wasm3 just exit(1) and prints a message.

And as to why the rest are faster, I spent much time optimizing the interpreter and learning what the best way to write interpreters is. Its mostly jump threading and Mixed Data.

titzer3y ago

Nice work!

fwsgonzo3y ago

haberman3y ago

1 more reply

4984OP3y ago· 3 in thread

haberman3y ago

wahern3y ago

> that compilers do not do well with this pattern

Has there been any movement in GCC wrt the tailcalls feature?

The tailcall technique is a much more sane and profitable approach, of course.

1 more reply

naasking3y ago

> My experience working on this problem led me to the same conclusion as Mike Pall, which is that compilers do not do well with this pattern

Note that that message is from twelve years ago. A lot's changed since then, not just in compilers but in CPUs. Branch prediction is a lot better now.

2 more replies

habibur3y ago· 2 in thread

I am afraid it doesn't work on linux.

heleninboodler3y ago

Worked flawlessly for me on linux. The readme should mention 'emcc' and 'wasm3' as prerequisites for following the bench.py instructions, though.

thechao3y ago

I'm on macOS, and if I do this:

    > git clone ...
    > make -j
    > ./bin/wasm2wat ./test/core/address.wast

I get:

    >>> ./bin/wat2wasm test/core/address.wast 
    unexpected word: `` byte=256

cornstalks3y ago

I'm really interested in a fast interpreter-only Wasm VM that can allow the host to share some of its memory with the VM.

[1]: https://github.com/wasm3/wasm3/issues/114

mouse_3y ago

I'd love to see a benchmark of this vs. libwasm https://github.com/SerenityOS/serenity/tree/master/Userland/...

duped3y ago

How does it compare to Wasmtime? Do you have a list of supported wasm features?

vmafficianado3y ago

Impressive!

Will work stop on MiniVM as result of Web49?

muricula3y ago

Has this been fuzz tested? Fuzz testing is one of the best ways of discovering security vulnerabilities in C code, and tools like libfuzzer and asan or AFL make it easier than ever.

j / k navigate · click thread line to collapse