Nevertheless, I think I'm learning more from this book than most other books I've tried before that are far more theoretical or abstract. I'm still eager to reach the chapter on implementing C types. I think it's a good book, but it requires more effort than something like Crafting Interpreters or Writing a Compiler/Interpreter in Go, while also covering topics not in those books.
Plus, you get to become proficient in OCaml, which is a pretty good language.
It feels like a more advanced version of Crafting Interpreters.
I haven’t looked at the OCaml implementation at all. The text and unit tests are all you need.
Discussion on the Ada Forum: https://forum.ada-lang.io/t/writing-a-c-compiler/1024
Is duck typing the pseudo-unsafe alternative? (Not unsafe as in accessing unsafe memory, but as in throwing exceptions if the duck-typed function doesn’t exist on the current type)
Can C handle both?
Coming from a static type system like rust and c#, i’m doing alot of “if this is a duck, duck.quack()” and i’m looking for faster alternatives and less verbosity if possible
data Tree a = Leaf a | Branch (Tree a) (Tree a)
fromList :: [a] -> Tree [a]
fromList = Leaf
toList :: Tree [a] -> [a]
toList (Leaf x) = x
toList (Branch (Leaf x) r) = x ++ toList r
toList (Branch (Branch l1 l2) r)
= toList (Branch l1 (Branch l2 r))
append :: Tree [a] -> Tree [a] -> Tree [a]
append = Branch
In a language that doesn't have tree pattern matching, the code wouldn't be this short and easy to understand, and I don't think it could be replicated just by having duck typing. Rust has pattern matching, but because it's primarily focused on lower-level concerns like pointers and memory ownership, pattern matching isn't this nice, because you have to pattern match on the pointer first.Since a compiler is all about tree manipulation, support for tree pattern matching should be a boon.
[0]: http://h2.jaguarpaw.co.uk/posts/demystifying-dlist/
[1]: https://en.wikipedia.org/wiki/Joel_Spolsky#Schlemiel_the_Pai...
https://archive.org/details/byte-magazine-1978-09 (part 1)
All 3 parts of Tiny Pascal:
https://albillo.hpcalc.org/publications/Easter%20Egg%20-%20T...
The most obvious change you'll see is the use of SSA, which has become the dominant representation in IR starting 25-30 years ago.
There's also been an increase in the importance of compiler IRs, and especially the concept of code passing through multiple IRs before reaching machine code.
Formal semantics has become more of a thing in the past decade or so. It's now routine that even weak memory models have a detailed formal model of how they work. In LLVM, it's now a requirement that you demonstrate formal correctness of new InstCombine transformations (which are essentially peephole optimizations).
The use of parser generators has really fallen into disrepute; everything has transitioned to handrolled parsers these days. Language standards themselves are starting to rely on context-sensitive keywords, which are hard to implement in a generator-based lexer/parser setup.
Optimizations have generally broadened in scope; we're now seeing whole-function level of optimization being the default way to look at stuff for analysis, and there's been a variety of techniques introduced to make whole-program optimization (aka LTO) much more tractable.
Another subtle but major shift is that compilers are increasingly reliant on inferred analysis from dumber representations over the original declared intent of the code. For example, SROA in LLVM (which breaks up structs so that individual fields can be independently allocated in registers) relies not on looking at the way the struct is declared but the offsets within the struct that various load and store operations use.
A final major shift is the trend towards support for ancillary tooling in the programming language space, so that the compiler isn't merely a tool that goes from source to object code, but is something that can be actively queried for information about the source code. Things like the language server, or smart formatting, or automatic refactoring tooling.
That "middle-pass" approach that will let you address many targets is still valid; the trick is finding a sufficiently robust and flexible internal representation at the right level. You also have to be able to out-guess the chip vendors where before you could go to the architect or a complete "System" book and get the real scoop, including things you shouldn't do. Oddly enough, there is simultaneously useful and completely worthless documentation scattered about the internet.
You might want to take a look at Muchnick and Jones' _Program_Flow_Analysis_ (yes, it's from 1981) but chapters 4-6 can be applied at code-generation time. How that fits modern Intel processors (for example) is unknown. Idealizing your processor as a RISC-V might be a reasonable way to proceed but in the end, you'll have to translate the code for the target -- it will be reasonably straight-forward if you drive it all from tables but it's not trivial.
https://news.ycombinator.com/item?id=40940799
> So what's different about writing a compiler in 2024 than say 10, 20, or 30 years ago?
As far as I can tell, the main difference is that static single assignment (SSA) as an intermediate form was not the norm 30 years ago, but it is nowadays. Also, in newer books, it's more common to go over global register allocation now, whether that's graph coloring or linear scan register allocation. If you read old compiler books, the main optimizations they talk about are use-def chains, moving computations out of loops, and using the local and tree-based Sethi-Ullman register allocation algorithm.
Today most languages are front-ends for LLVM IR, but LLVM is very slow and takes a long time to optimize. Many new languages target x86/arm directly with their own weakly optimized backends, and output an LLVM IR for "release builds".
and, while we're talking about ocaml, ocaml does use ocamllex and ocamlyacc for its own parser
so, while you can certainly do without parser generators, they have very commonly been used for making real-world programming languages. almost every programming language anyone here has ever heard of was first implemented with a parser generator. the main exceptions are probably fortran, cobol, algol, lisps, c, and pascal
Now-a-days, the difference between "big compiler optimized" and "little compiler not optimized" can be quite dramatic; but, is probably no more than 4x — certainly within range of the distinction between "systems programming language" and "high tuned JITted scripting language". I think most people are perfectly fine with the performance of highly-tuned scripting languages. The result is that all of the overhead of "big compiler" is just ... immaterial; overhead. This is especially true for the case of extremely well-tuned code, where the algorithm and — last resort — assembly, will easily beat out the best optimizer by at least an order-of-magnitude, or more.
Were they? GCC abandoned bison in favour of their own parser relatively recently.
On topic, though: wouldn't a simpler language (maybe even a pseudo language) be a better target for a first learning compiler. I understand they don't build a full C compiler, but still. It looks to me like there's a lot of complexity add from choosing such a lofty target.
For writing an interpreter or transpiler, there are probably better options, but for a true compiler I can’t think of a better choice than C (or at least a subset of C).
SML is very dated and the standard library and ecosystem lack many things that are considered table stakes for a viable programming language nowadays. And F# and Scala are fine as enterprise languages, but being tied to .NET and Java respectively makes them less desirable for implementing a language that won't itself be coupled to one of those runtimes.
So OCaml was probably the most mainstream choice among the languages with appropriate tools, as funny as that sounds. And honestly, once you get over the syntax, it doesn’t actually have anything outrageous.
It's been covered on several threads here over the years [1].
[0]: https://craftinginterpreters.com/ [1]: https://hn.algolia.com/?q=crafting+interpreters
I’ve been bored with building line-of-business applications, despite designing for complex requirements in high-volume distributed systems.
In fact I took a break from CS learning entirely 9 months ago. Even reading HN. I’ve been studying electronics and analog signal processing instead.
But now that I’ve built about 50 guitar pedals of increasing complexity, I feel ready to switch back to CS studies again.
https://www.amazon.com/Writing-Compiler-Programming-Language...
https://onlinebooks.library.upenn.edu/webbin/book/lookupid?k...
Aside from that, I encourage everyone who cites Compiler Construction to actually work through the first 10% of the book and then count the number of errata.
While they teach similar content, they have a different approach.
There are literally thousands of compiler design books out there, I don't really see anything particularly comparable between this book and Wirth's
That last book was Allen Holub's "Compiler Design in C", which is from 1990. Here's how the blurb on the back describes it:
> Allen I. Holub's Compiler Design in C offers a comprehensive, new approach to compilers that proves to be more accessible to computer science students than the other strictly mathematical books.
> With this method in mind, the book features three major aspects:
> (1) The author develops fully functional versions of lex and yacc (tools available in the UNIX® operating system to write compilers), (2) he uses lex and yacc to develop a complete C compiler that includes parts of C that are normally left out of compiler design books (eg., the complete C "type" system, and structures), and (3) the version of yacc developed here improves on the UNIX version of yacc in two ways (error recovery and the parser, which automatically produces a window-oriented debugging environment in which the parse and value stacks are visible).
It's out of print, but the author has made a searchable PDF available on his website [1]. I found it quite useful.
Holub seems to like the "learn by doing" approach. He's got another book, "Holub on Patterns" that teaches all the design patterns from the gang of four book organically by developing two programs that together use all of those patterns. The two programs are an embedded SQL interpreter and a GUI application for Conway's Game of Life.
PS: Ooh. It occurred to me that No Starch Press books are often available on O'Reilly Learning. I checked and this one is there. So I guess it is going on my "don't need but am doing out of curiosity" pile after all.
Several "compiler light" style articles and books kind of walk over that part, and it can be non-trivial to do properly, especially with modern expectations.
I remember way back in the day, one of the early C compilers for the PDP, and, honestly, it could almost be argued that ed(1) had better error messages than what that thing produced.
A lot of simple compilers hit an error and just give up.
So, just curious what the approach was in this book.
[0] actually from the readme in the github repo[1] it seems to be a C subset, not all of C
(imagine a medieval accountant trying to learn to do long division in roman numerals. he'll be much better off learning the western arabic numerals fibonacci is so excited about)
Modern Compiler Implementation in ML: https://www.cs.princeton.edu/~appel/modern/ml/
As an undergrad student, I think the C version is kinda easier to understand, though.
A good engineer should be able to use the right tool for the job
https://www.amazon.com/Retargetable-Compiler-Design-Implemen...
Was featured here a couple of times.
Unfortunately the timing of the release is quite unfortunate with regards to the summer holidays. Will take a look at it next year
https://news.ycombinator.com/item?id=40940799
So maybe you saw it then.
Each chapter of the book includes a test suite to run against the code you’ve written.
In some ways, the tests in this book feel very similar to the labs in the book Computer Systems: A programmers perspective — which is high praise!
Alternatively, I would like to learn about not just how to make a compiler, but also simultaneously a debugger, hot-reloading, etc.
However, it's also boring.
Nevertheless the contents of the book cover all the techniques required to write an assembler, if you'd really like to
If there is some library that can help create machine code from assembly instructions on a line by line basis (at least as opposed to invoking a separate program that generates the entire binary collectively from the assembly code), that could also work.
In my case, I already know enough of the lexer, parser, etc., parts. What's missing is going all the way to making a debugger, profiler, etc.
It's also "fun" if some instructions come in different sizes... and you may need stronger restrictions on allowed expressions in that case.
https://www.linuxfromscratch.org/
"Linux From Scratch (LFS) is a project that provides you with step-by-step instructions for building your own custom Linux system, entirely from source code."