Storing row/column in file location data is wasteful - just a file offset should be enough. Whenever the row/column coordinates are needed (normally only in user messages) they can be quickly recomputed.
In effect, parsed tokens can be stored as just an offset - a 4 or 8 byte integer.
And definitely my focus in writing a post like this is to be as explicit as reasonable to serve as the best educational reference.
I also haven't seen real world compilers do what you say (thinking V8, CPython, other mainstream ones) but I'll keep an eye out for it now.
In general it is of course better to store an index to a data where it's needed rather than a copy of the data.
Token data will not typically be needed a lot: The token kind (integer-literal, plus-operator, open-paren etc.) at exactly one point in the parsing phase and possibly in the type checking phase but you could have a separate "literal expression" kind for that. Binary payloads (string bytes, floating point value etc.) will be needed in the constant phase for literal tokens. I wouldn't expect that a little optimization here is a difficult tradeoff to make - neither with regards to speed nor code complexity.
Update, here is bitwise's code. It looks very good to me from a glance: https://github.com/pervognsen/bitwise/blob/master/ion/lex.c . I think thinking about this gives interesting insights, and the token struct approach is an illustration of how "object-oriented" approaches can go wrong. The concept of a token exists merely at the syntactical level, and the token kind controls the code flow of a recursive descent parser, but I don't see any good reasons to store a token represented as a tagged union. Punctuation tokens are completely ephemeral, and constant data, literal kinds, operator kinds etc. all should better go to very distinct places in the data store that have no resemblance to "token objects".
This API is the only bad thing about Rust!
.expect("Could not read file")
It’s so unfortunate to have an API that reads .expect("thing we don’t expect")
I think we should all just forget it’s there and use .unwrap_or_else(|| panic!(“thing we don’t expect”)) let mut file = File::create("valuable_data.txt").unwrap();
file.write_all(b"important message").expect("failed to write message");
The use of .expect("Could not read file") is consistent with Rust's documentation.And I agree with OP that .expect() is poorly named. I wouldn't raise it as an issue on the work of others who just use Rust and are stuck with it though.
.or_panic(“unexpected fatal error”)
Or
.unwrap_or_panic
Instruction::Add => {
let left = data.pop().unwrap();
top = left + top;
pc += 1;
}
Instruction::Subtract => {
let left = data.pop().unwrap();
top = left - top;
pc += 1;
}
Instruction::LessThan => {
let left = data.pop().unwrap();
top = if left < top { 1 } else { 0 });
pc += 1;
}My actual long-term goal is to build a VM in Rust and then use this to re-do the Make A Lisp project [0]. I completed this a couple of years ago using C# but felt vaguely uneasy that I was using C# to do the heavy lifting associated with garbage collection, etc.
It basically means you can do something like "goto opcode_table[*(++ip)];"
GCC offers it as a non-standard extension to C.
https://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html
FORTRAN has had it since 1957. But Pascal and C purged "evil computed GOTO" and only offered non-computed goto. Then Java etc. purged non-computed goto.I published a benchmark of various techniques of VM implementation today. It was conducted in C++ compliant C, but the results should be applicable to Rust. Just calling the function will always be the slowest due to the work the processor has to do for each instruction.
https://github.com/shadowofneptune/threaded-code-benchmark
Basically, you're better off with a massive switch statement, even if it's ugly in terms of program organization.
- Tokens are the leaves of your syntax trees
- File locations are relative, not absolute
It's easier to build a parser that doesn't buy into these things, but it's way harder to build tooling and good error messaging if you don't.
Using an absolute line/column/index to indicate the location of a token in a file means you need to retokenize a large amount of the file whenever there's an edit. In a streaming parser that responds to live edits, storing the tokens with relative file locations allows you to only update neighbors on insertion or deletion (and also allows for bulk insert/delete). Absolute file locations can be recovered by scanning the token stream - which is why it's important for the tokens to form the leaves of the tree, it guarantees you can always recover the file locations.
Then you can pull some tricks like the Roslyn compiler does, which is to store things like comments, diagnostics, and white space as "trivia" associated with the tokens and very quickly recover the text that created the token stream/AST nodes. That's invaluable for tooling.