Writing a minimal Lua implementation with a virtual machine from scratch in Rust (opens in new tab)

(notes.eatonphil.com)

133 pointsfinite_jest4y ago29 comments

29 comments

23 comments · 7 top-level

jstimpfle4y ago· 5 in thread

Being that tokens are the leaves of the AST, there are a lot of them and they can take a lot of space. To save memory it is a good idea to store only a file location instead of a full token. Whenever token information is needed, just lex again to get the full token, starting at the file location. This works only for languages with a context-free lexical syntax, of course (and not entirely sure "context-free" is the right term here but you get what I mean).

Storing row/column in file location data is wasteful - just a file offset should be enough. Whenever the row/column coordinates are needed (normally only in user messages) they can be quickly recomputed.

In effect, parsed tokens can be stored as just an offset - a 4 or 8 byte integer.

eatonphil4y ago

This sounds like premature optimization to me (I know, that phrase gets used a lot lately). As with all optimizations it probably makes sense to not do it and keep your code highly readable until the point where you've profiled issues at this level.

And definitely my focus in writing a post like this is to be as explicit as reasonable to serve as the best educational reference.

I also haven't seen real world compilers do what you say (thinking V8, CPython, other mainstream ones) but I'll keep an eye out for it now.

fsloth4y ago

Is this really the right space to optimize? Having the relevant data on hand in tight array is generally much faster than having to fetch and compute something from somewhere else?

In general it is of course better to store an index to a data where it's needed rather than a copy of the data.

jstimpfle4y ago

I've seen other experienced people do this, like Jonathan Blow (Jai) and Per Vognsen (bitwise) I believe. I recall measuring a large space overhead from tokens myself when working on my toy compiler.

Token data will not typically be needed a lot: The token kind (integer-literal, plus-operator, open-paren etc.) at exactly one point in the parsing phase and possibly in the type checking phase but you could have a separate "literal expression" kind for that. Binary payloads (string bytes, floating point value etc.) will be needed in the constant phase for literal tokens. I wouldn't expect that a little optimization here is a difficult tradeoff to make - neither with regards to speed nor code complexity.

Update, here is bitwise's code. It looks very good to me from a glance: https://github.com/pervognsen/bitwise/blob/master/ion/lex.c . I think thinking about this gives interesting insights, and the token struct approach is an illustration of how "object-oriented" approaches can go wrong. The concept of a token exists merely at the syntactical level, and the token kind controls the code flow of a recursive descent parser, but I don't see any good reasons to store a token represented as a tagged union. Punctuation tokens are completely ephemeral, and constant data, literal kinds, operator kinds etc. all should better go to very distinct places in the data store that have no resemblance to "token objects".

CJefferson4y ago

It might be worth storing this data separately, to improve cache usage, but my experience is on any modern machine the amount of space taken by parsed code / ASTs is small enough to not care about, unless you do something like C++ where you "reparse the world" for every tiny file (due to header includes)

jstimpfle4y ago

Maybe it's only necessary for serious compilers. Let's say 10 tokens / line avg., 40 bytes / token (but a naive token struct could easily be 100s of bytes), then we're in the region of 100s to 1000s of bytes per line. Now let's say compiling 100K lines would not be unheard of, and for benchmarks you want to push the millions. We're getting into regions where tokens alone can fill a computer's memory. If there is an easy to make and effective optimization, I'm all for it :-)

da39a3ee4y ago· 3 in thread

The article looks great and I’m looking forward to reading it; this comment is not a criticism of the article.

This API is the only bad thing about Rust!

  .expect("Could not read file")

It’s so unfortunate to have an API that reads

  .expect("thing we don’t expect")

I think we should all just forget it’s there and use

  .unwrap_or_else(|| panic!(“thing we don’t expect”))

vaylian4y ago

"expect" is like the word "assert". You express which invariant should always be true and therefore there should be no failure. The use of `.expect("Could not read file")` is actually wrong, because it can fail.

hvdijk4y ago

That is not how Rust documents expect() should be used. Rust's own documentation (https://doc.rust-lang.org/std/result/) contains:

  let mut file = File::create("valuable_data.txt").unwrap();
  file.write_all(b"important message").expect("failed to write message");

The use of .expect("Could not read file") is consistent with Rust's documentation.

And I agree with OP that .expect() is poorly named. I wouldn't raise it as an issue on the work of others who just use Rust and are stuck with it though.

rastignack4y ago

I also wish I could write:

.or_panic(“unexpected fatal error”)

.unwrap_or_panic

eatonphil4y ago· 3 in thread

Hey folks just saw this, author here. Happy to answer questions!

FullyFunctional4y ago

Always happy to read compilers, Rust, or VMs. One quick and cheap trick that can help your VM, especially when you have a lot of arithmetics: explicitly represent the top of stack. Eg.

            Instruction::Add => {
                let left = data.pop().unwrap();
                top = left + top;
                pc += 1;
            }
            Instruction::Subtract => {
                let left = data.pop().unwrap();
                top = left - top;
                pc += 1;
            }
            Instruction::LessThan => {
                let left = data.pop().unwrap();
                top = if left < top { 1 } else { 0 });
                pc += 1;
            }

eatonphil4y ago

Yep you're right. I think I made a note if it in the post at one or two points that I'm inefficiently doing these stack pushing/popping.

KineticLensman4y ago

No questions (yet) but I found this a really good post. I'm a Rust newbie and I've coincidentally just started working through 'Crafting Interpreters' using Rust as my implementation language. Some of your code has helped clarify things in my mind.

My actual long-term goal is to build a VM in Rust and then use this to re-do the Make A Lisp project [0]. I completed this a couple of years ago using C# but felt vaguely uneasy that I was using C# to do the heavy lifting associated with garbage collection, etc.

[0] https://github.com/kanaka/mal

cgoto897984y ago· 3 in thread

Does Rust have computed goto, which really helps interpreter speed?

It basically means you can do something like "goto opcode_table[*(++ip)];"

GCC offers it as a non-standard extension to C.

  https://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html

FORTRAN has had it since 1957. But Pascal and C purged "evil computed GOTO" and only offered non-computed goto. Then Java etc. purged non-computed goto.

chc44y ago

Not as a language feature. You can setup your code in a way that will cause LLVM to emit identical output as if you did have computed gotos, but there's no guarantee as to it.

melony4y ago

What's the overhead of storing function pointers in a Rust array and calling it by index?

shadowofneptune4y ago

You have to call the function. This adds a new stack frame, which then is closed when the function concludes. This means you're constantly going in, going out, going in going out...

I published a benchmark of various techniques of VM implementation today. It was conducted in C++ compliant C, but the results should be applicable to Rust. Just calling the function will always be the slowest due to the work the processor has to do for each instruction.

https://github.com/shadowofneptune/threaded-code-benchmark

Basically, you're better off with a massive switch statement, even if it's ugly in terms of program organization.

1 more reply

duped4y ago· 2 in thread

Working on tokenization and parsing there have been two "lights clicking on" moments that I think every dev working on a PL implementation should have :

- Tokens are the leaves of your syntax trees

- File locations are relative, not absolute

It's easier to build a parser that doesn't buy into these things, but it's way harder to build tooling and good error messaging if you don't.

jstimpfle4y ago

Could you clarify what you mean by "relative" file locations? Relative to containing AST node?

duped4y ago

Relative to the preceding token in the stream, in number of lines and columns from the end of the last token (or beginning, but to me that's conceptually confusing). Take a look at the language server protocol specification for semantic tokens with relative positions as a reference.

Using an absolute line/column/index to indicate the location of a token in a file means you need to retokenize a large amount of the file whenever there's an edit. In a streaming parser that responds to live edits, storing the tokens with relative file locations allows you to only update neighbors on insertion or deletion (and also allows for bulk insert/delete). Absolute file locations can be recovered by scanning the token stream - which is why it's important for the tokens to form the leaves of the tree, it guarantees you can always recover the file locations.

Then you can pull some tricks like the Roslyn compiler does, which is to store things like comments, diagnostics, and white space as "trivia" associated with the tokens and very quickly recover the text that created the token stream/AST nodes. That's invaluable for tooling.

1 more reply

xvilka4y ago

There's also Luster[1].

[1] https://github.com/kyren/luster

debdut4y ago

Thanks for sharing! A great learning

j / k navigate · click thread line to collapse

29 comments

23 comments · 7 top-level

jstimpfle4y ago· 5 in thread

In effect, parsed tokens can be stored as just an offset - a 4 or 8 byte integer.

eatonphil4y ago

And definitely my focus in writing a post like this is to be as explicit as reasonable to serve as the best educational reference.

I also haven't seen real world compilers do what you say (thinking V8, CPython, other mainstream ones) but I'll keep an eye out for it now.

fsloth4y ago

Is this really the right space to optimize? Having the relevant data on hand in tight array is generally much faster than having to fetch and compute something from somewhere else?

In general it is of course better to store an index to a data where it's needed rather than a copy of the data.

jstimpfle4y ago

I've seen other experienced people do this, like Jonathan Blow (Jai) and Per Vognsen (bitwise) I believe. I recall measuring a large space overhead from tokens myself when working on my toy compiler.

CJefferson4y ago

jstimpfle4y ago

da39a3ee4y ago· 3 in thread

The article looks great and I’m looking forward to reading it; this comment is not a criticism of the article.

This API is the only bad thing about Rust!

  .expect("Could not read file")

It’s so unfortunate to have an API that reads

  .expect("thing we don’t expect")

I think we should all just forget it’s there and use

  .unwrap_or_else(|| panic!(“thing we don’t expect”))

vaylian4y ago

hvdijk4y ago

That is not how Rust documents expect() should be used. Rust's own documentation (https://doc.rust-lang.org/std/result/) contains:

  let mut file = File::create("valuable_data.txt").unwrap();
  file.write_all(b"important message").expect("failed to write message");

The use of .expect("Could not read file") is consistent with Rust's documentation.

And I agree with OP that .expect() is poorly named. I wouldn't raise it as an issue on the work of others who just use Rust and are stuck with it though.

rastignack4y ago

I also wish I could write:

.or_panic(“unexpected fatal error”)

.unwrap_or_panic

eatonphil4y ago· 3 in thread

Hey folks just saw this, author here. Happy to answer questions!

FullyFunctional4y ago

Always happy to read compilers, Rust, or VMs. One quick and cheap trick that can help your VM, especially when you have a lot of arithmetics: explicitly represent the top of stack. Eg.

            Instruction::Add => {
                let left = data.pop().unwrap();
                top = left + top;
                pc += 1;
            }
            Instruction::Subtract => {
                let left = data.pop().unwrap();
                top = left - top;
                pc += 1;
            }
            Instruction::LessThan => {
                let left = data.pop().unwrap();
                top = if left < top { 1 } else { 0 });
                pc += 1;
            }

eatonphil4y ago

Yep you're right. I think I made a note if it in the post at one or two points that I'm inefficiently doing these stack pushing/popping.

KineticLensman4y ago

[0] https://github.com/kanaka/mal

cgoto897984y ago· 3 in thread

Does Rust have computed goto, which really helps interpreter speed?

It basically means you can do something like "goto opcode_table[*(++ip)];"

GCC offers it as a non-standard extension to C.

  https://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html

FORTRAN has had it since 1957. But Pascal and C purged "evil computed GOTO" and only offered non-computed goto. Then Java etc. purged non-computed goto.

chc44y ago

Not as a language feature. You can setup your code in a way that will cause LLVM to emit identical output as if you did have computed gotos, but there's no guarantee as to it.

melony4y ago

What's the overhead of storing function pointers in a Rust array and calling it by index?

shadowofneptune4y ago

You have to call the function. This adds a new stack frame, which then is closed when the function concludes. This means you're constantly going in, going out, going in going out...

https://github.com/shadowofneptune/threaded-code-benchmark

Basically, you're better off with a massive switch statement, even if it's ugly in terms of program organization.

1 more reply

duped4y ago· 2 in thread

Working on tokenization and parsing there have been two "lights clicking on" moments that I think every dev working on a PL implementation should have :

- Tokens are the leaves of your syntax trees

- File locations are relative, not absolute

It's easier to build a parser that doesn't buy into these things, but it's way harder to build tooling and good error messaging if you don't.

jstimpfle4y ago

Could you clarify what you mean by "relative" file locations? Relative to containing AST node?

duped4y ago

1 more reply

xvilka4y ago

There's also Luster[1].

[1] https://github.com/kyren/luster

debdut4y ago

Thanks for sharing! A great learning

j / k navigate · click thread line to collapse