Stage0 – A set of minimal C compiler bootstrap binaries (opens in new tab)

(github.com)

200 pointsz29LiTp5qUC30n7y ago68 comments

https://github.com/oriansj/stage0/blob/master/Linux%20Bootstrap/x86/cc_x86.M1 It also appears to be bootstrapped from hex

68 comments

48 comments · 16 top-level

Dylan168077y ago· 5 in thread

Does it support all of C?

Is this actually smaller than TCC? It's kind of hard to tell since TCC is split into files and I don't know what's actually necessary. And that question includes earlier, less featureful versions of TCC. Note that the more-or-less-C compiler it's based off of is absolutely miniscule: https://bellard.org/otcc/

oriansj7y ago

Is it really fair to do a size comparision between a statically linked C compiler written in Assembly and designed for readability with an a dynamically linked binary which has none of those things and was designed to win the International Obfuscated C Code Contest??

Also if you add up the binaries used via dynamic linking you are looking at which in the case of /lib/x86_64-linux-gnu/libc.so.6 is 16MB in size?

Did I mention it works on all x86+amd64 POSIX systems too?

Dylan168077y ago

The un-obfuscated version is commented and readable and still only ~15KB of source code.

It uses calloc, fopen, and fgetc, which can trivially forwarded to the same syscalls as stage0 with only a few bytes. It uses isalnum, isspace, isdigit, strcpy, and strstr, which are all trivial. Add all that up and maybe you need to penalize it by a kilobyte. I'm not sure exactly how it's using dlsym but it seems to also be worth few bytes. It's not using the other 16MB of libc.

derefr7y ago

I imagine that it doesn’t need to support all of C, so much as it needs to support enough of C to compile a more featureful C compiler.

Which makes for a pertinent question: can you compile GCC or Clang with this?

z29LiTp5qUC30nOP7y ago

it is less than 15kb in size written in assembly supports structs, unions, inline assembly, gotos, breaks, function pointers Thus it support more than otcc supported in a fraction of the size

Dylan168077y ago

The file you linked is 200KB.

OTCC is <4KB. It's admittedly using tiny names to squeeze under a size limit, but even when you correct for that it's a lot smaller.

(Edit: At the time of submission the link went to https://github.com/oriansj/stage0/blob/master/Linux%20Bootst... with a description of "World's Smallest C Compiler")

1 more reply

logicallee7y ago· 5 in thread

What special limits can this limbo under? 15 kb is so small.

It's even smaller than l1 cache. (In case you wanted to bruteforce every possible text file you could feed this binary, and wanted to do it all on cache in the CPU).

Maybe it's the size of payload you can put into a usb C cable or something.

I mean it's just so freaking small. Any ideas what limits this is "small enough" to fit under?

Dylan168077y ago

> Maybe it's the size of payload you can put into a usb C cable or something.

The part you grab on a typical usb C cable is just about the size of a microsd card. You might have to use a slightly narrower chip, but you can also go many times thicker. So if you're sneaking storage into a usb C cable think "terabyte".

(Unless you mean hijacking an existing chip, which might have 0 writable storage or might have a megabyte, who knows.)

logicallee7y ago

yes I meant the latter. maybe that's all a malicious agent has for usable payload for whatever reason. I really had to think hard to come up with that, it's not meant to be realistic.

roywiggins7y ago

It would fit inside the MERCIA relay computer's ROM. http://www.relaiscomputer.nl/index.php/memory

It could comfortably fit inside the Apollo lander computer's RAM three times over.

You can easily build a ROM this big in Minecraft. https://www.youtube.com/watch?v=e4TXjhZLHpw

You could encode it into a relatively short Twitter thread https://qntm.org/twitcodings

logicallee7y ago

but none of these come close to a real world use case. like, where does this limbo bar setting come from? who chose 15k and why?

3 more replies

zik7y ago

It'd fit in an STM32 microcontroller's 64K RAM comfortably, with room to spare for some actual programs. That's actually a pretty useful application. Except I think it targets x86 rather than ARM.

rain17y ago· 4 in thread

of course this isn't the smallest C compiler, it's 5000 lines of code. There is a 500 line C compiler written in C.

But perhaps it's the smallest if we count transitive dependencies! (by this I mean count the lines of code of the program, and every program you need to build the program and everything you need to build them and so on)

reidrac7y ago

"[..]only have the goal of creating a bootstrapping path to a C compiler capable of compiling GCC, with only the explicit requirement of a single 1 KByte binary or less."

I believe the "size" refers to the binary size of the compiler.

sctb7y ago

We've updated the title from the editorialized “World's Smallest C Compiler”.

rain17y ago

Why are people downvoting this comment?

kragen7y ago

Because Hacker News is largely populated by "finance-obsessed man-children and brogrammers", in JWZ's memorable phrase, whose votes it unfortunately weighs as highly as yours, mine, or oriansj's.

0xffff27y ago· 3 in thread

>Additionally, all code must be able to be understood by 70% of the population of programmers. If the code can not be understood by that volume, it needs to be altered until it satifies the above requirement.

How is this measured/qualified? These days, I would doubt that 70% of people whose primary job is to write code have any knowledge of assembly whatsoever, so a naive reading of the above paragraph seems unlikely to succeed.

stcredzero7y ago

These days, I would doubt that 70% of people whose primary job is to write code have any knowledge of assembly whatsoever

If pressed, I suspect most of the programmers in the world could read assembly. They might hate it, but they could do it, if given sufficient motivation. Simplified assembly used to be written as a game.

https://www.corewars.org/index.html

so a naive reading of the above paragraph seems unlikely to succeed.

How naive are you going here? Turn it into a contest, where the versions of the code contain backdoors, and contestants are ranked by how quickly and accurately they can identify them. Arrange for cash prizes, and you'd have your determination.

0xffff27y ago

>If pressed, I suspect most of the programmers in the world could read assembly. They might hate it, but they could do it, if given sufficient motivation. Simplified assembly used to be written as a game.

I strongly disagree. I work with a lot of very smart people. These people are Phd researchers at the top of their field, working on cutting edge algorithmic development. Their work is primarily mathematical, but they write enough MATLAB to prove that what they're developing really works. They write enough code that I think most people would consider them "programmers", yet they absolutely do not understand C++, much less assembly. As I said, these are very smart people; they could certainly learn, but they have no motivation to do so. It's not their job to understand all of the details of how a computer works (it's mine, more or less).

>How naive are you going here? Turn it into a contest, where the versions of the code contain backdoors, and contestants are ranked by how quickly and accurately they can identify them. Arrange for cash prizes, and you'd have your determination.

I didn't really mean naive in the sense of the people, but in the intent of the author. When I read "70% of the population of programmers", I think of myself and my ~6 coworkers who spend ~50+% of their time manipulating code. That's the simplest (i.e. naive) definition of "programmer" that I can come up with. If the author intended a different definition (like "people who claim to understand any assembly language"), then that definition might exclude my coworkers, making the goal a lot more achievable.

For the naive definition, only 14% (1/7) of my group have any chance of understanding this project. I think you could find a lot of front-end web focused groups where the percentage was much lower than that, and at this point I think those groups far outnumber the embedded systems groups where the number would be much closer to 100%.

1 more reply

BanazirGalbasi7y ago

I wouldn't be surprised if none of the people who learn to code through bootcamps or self-taught modules can read assembly. Most of that kind of stuff teaches web development and not much else, so any code lower on the stack is mostly unfamiliar territory.

2 more replies

kragen7y ago· 3 in thread

What oriansj (as well as rain1 and others) are doing is both very impressive and important. The objective is to get us to bootstrappability and, for example, escape Trusting-Trust attacks; one reason it's profoundly important is the long-term archival problem. Media longevity is one crucial part of archival, but as we were discussing today in https://news.ycombinator.com/item?id=20272557, there are plausible solutions to that problem.

Interpretability is another part of the problem: even if we recovered an executable copy of Ivan Sutherland's historically groundbreaking program SKETCHPAD, for example, we wouldn't be able to run it because we don't know the instruction set for the computer it was built for. Remember that the entire body of knowledge about Ancient Egyptian culture was lost in the 5th Century, when the Christian Dark Age closed the temples, and not regained for almost 1400 years — and then only due to the great good fortune of the Rosetta Stone.

A bootstrappable computing stack is a crucial part of the "Rosetta Stone" that will be needed to preserve 21st-century knowledge. One of the few papers tackling the interpretability problem in this form is http://www.vpri.org/pdf/tr2015004_cuneiform.pdf, "The Cuneiform Tablets of 2015", by Long Tien Nguyen and Alan Kay.

There's a more immediate necessity, though. As recent events make clear — Chrome's extension API kneecapping ad-blockers, the increasing effectiveness of Chinese censorship, and the shocking US$12 million award to Nintendo last November against ROM site operators for preserving classic video games, for example — the current political and economic system cannot be trusted to preserve our access to our cultural heritage, even during our lifetimes. That means that we need an autonomously-bootstrappable trustworthy free-software infrastructure that is viable without the massive economies of scale that fund mainstream platforms like Linux, Android, MacOS, Chrome, and even Firefox. If your personal archive of the Tank Man photo, the Arab Spring tweets, or the video of the murder of Philando Castile runs afoul of future malicious-content filters integrated into your operating system, there is no guarantee that it, or you, will survive.

So we're doing our best to get some green shoots established before the situation has any opportunity to get worse.

rain17y ago

If you asked me what the 4 best documents regarding bootstrapping are i'd say:

* Egg of the Phoenix (Blog post) - http://canonical.org/~kragen/eotf/

* The Cuniform Tablets of 2015 (Blue-sky academic research) - http://www.vpri.org/pdf/tr2015004_cuneiform.pdf

* Preventing The Collapse of Civilization (Video) - https://www.youtube.com/watch?v=pW-SOdj4Kkk

* Coding Machines (scifi story about trusting-trust attack) - https://www.teamten.com/lawrence/writings/coding-machines/

kragen7y ago

This is awesome! Thank you! And thank you for the flattering reference to my own thought experiment there.

ahazred8ta6y ago

5000 years after its introduction, cuneiform is still a wedge issue. ... :P

zachrose7y ago· 3 in thread

Does this have any implications for trusting trust?

sansnomme7y ago

It just means that your compiler is theoretically safe if you build a assembler manually and verify it. This can be done by printing the assembly of the assembler (an assembler is just a glorified find and replace if you leave out optimizations) and then translate it into machine zeros and ones by hand (See Intel Reference Manual for more details on the translation table). To speed this up, distribute the typing input, use e.g. Mechanical Turk, minimum wage clerks etc. and compare the result from multiple sources to ensure accuracy. Once your confident that your machine code translation is an accurate representation of your assembler, run the assembler on Stage0 and the bootstrap process should take care of itself.

Gaelan7y ago

How do you get those zeroes and ones into a machine-readable format, without a trusted text editor or OS?

3 more replies

cestith7y ago

Yes, but only when considering above the level of a hidden machine monitor built into your processor. It could have huge implications for the compiler, libraries, and OS - even firmware. But you're still running on hardware.

bloak7y ago· 2 in thread

The trouble with bootstrapping GCC is that it requires flex/lex. Have you tried bootstrapping flex/lex?

It sounds stupid, I know, but when I investigated possible paths for getting from zero to GCC, flex looked like the biggest potential obstacle.

There are also C files in GCC/binutils that are generated by complex shell scripts. That was perhaps the second biggest obstacle.

(If I recall correctly, bison's not a problem: old versions of bison don't use bison.)

giomasce7y ago

My project https://gitlab.com/giomasce/nbs is currently able to more or less bootstrap flex, starting from only tcc and musl (within a running Linux kernel). Terms and conditions may apply: I haven't tested much the generated binary yet, and given the tricks I have to do to get there miscompilations and introduced bugs are everything but impossible. But at least ideally it should be possible to iron them out without changing the whole path.

Bison is my next target, but I haven't been able to work much on it lately. It's true that it should be possible to bootstrap it by history, although I really hope that one does not need to many steps to get to the latest release.

bloak7y ago

That's very interesting! Could you summarise the relevant steps? It appears that you're using "The Heirloom Project", which I hadn't heard of. (Is it related to Illumos?) So the Heirloom lex can be built without lex/yacc, and then you can build which version of flex using that?

1 more reply

agumonkey7y ago· 2 in thread

Oh so this is similar to kragen 'basement experiment' .. kudos

ahazred8ta7y ago

https://github.com/kragen/stoneknifeforth is 114 non-comment lines of code, compiles to 4063 bytes. (he mentions it's about half the size of the otccelf tinyc)

kragen7y ago

Yeah, SKF could definitely be improved. I'm excited about stage0!

kaushalmodi7y ago· 2 in thread

So cool to see Org mode READMEs in projects unrelated to Emacs!

I hope Github improves Org mode support at some point.

svnpenn7y ago

I like Org mode, but the syntax highlighting support is not great:

https://github.com/wallyqs/org-ruby/issues/64

I have been looking at AsciiDoc:

https://asciidoctor.org/docs/asciidoc-syntax-quick-reference

kaushalmodi7y ago

> I like Org mode, but the syntax highlighting support is not great

I know. That's why I said "I hope Github improves Org mode support at some point." :)

> I have been looking at AsciiDoc

If you like Org mode, you can overcome the Github/etc. Org mode rendering limitations by hosting small static sites for your projects. You may choose to do so by simply exporting Org to HTML using ox-html[1], or even exporting Org to markdown for Hugo (disclaimer: my package -> ox-hugo)[2].

I mean, if you like using Org mode, and Github et al don't see the value in improving support for that, don't use them to render Org mode docs :)

[1]: https://eless.scripter.co/

[2]: https://ox-hugo.scripter.co/

1 more reply

rgoulter7y ago· 1 in thread

This is a set of manually created hex programs in a Cthulhu Path to madness fashion. Which only have the goal of creating a bootstrapping path to a C compiler capable of compiling GCC, with only the explicit requirement of a single 1 KByte binary or less.

What a wonderfully mad, cool goal.

userbinator7y ago

I like to think of exercises like this as a "bootstrap pilgrimage", and I hope the phrase catches on.

edwintorok7y ago· 1 in thread

Does this have similar goes as https://www.gnu.org/software/mes/?

shakna7y ago

> Mes is inspired by The Maxwell Equations of Software: LISP-1.5 – John McCarthy page 13, GNU Guix's source/binary packaging transparency and Jeremiah Orians's stage0 ~500 byte self-hosting hex assembler.

stage0 is one of mes' inspirations, so I'd say there's a level of connection there.

Apparently mes is also working towards being able to be compiled by M2-Planet by the same author [0], at which point it might be possible to eventually build mes from stage0 as it can now bootstrap M2-Planet.

[0] https://github.com/oriansj/mes-m2

sittingnut7y ago· 1 in thread

why link to github when main repository is in savannah.gnu.org?

shakna7y ago

> pull requests can be made at https://github.com/oriansj/stage0 and https://gitlab.com/janneke/stage0 or patches/diffs can be sent via email to Jeremiah (at) pdp10 [dot] guru or join us on freenode’s #bootstrappable

The main contributing points seem to be github and gitlab, rather than savannah.

sly0107y ago

Do the instructions in the README count as part of the program?

If not, then the smallest binary to bootstrap a C compiler is actually a single jump to a C compiler in memory with a README containing the memory dump in to be typed in :)

Seriously though, it reminds me of the Toaster Project [0] where an RCA student attempted to build a modern toaster without using the modern supply chain.

[0] http://www.thetoasterproject.org/

cylinder7147y ago

See also Edmund Grimley Evans' bcompiler, mirrored at https://github.com/certik/bcompiler

On a related note, one can write arbitrary bytes to a file (on a Unix-like system) using GNU 'echo' or the 'printf' utilities. Chris Wellons' post "A Magnetized Needle and a Steady Hand" describes how to write a basic utility in this way.

tomcam7y ago

Read this code if you think you don’t understand assembly. A wonderful project, very clearly written.

kjhughes7y ago

Clickable link: https://github.com/oriansj/stage0/blob/master/Linux%20Bootst...

j / k navigate · click thread line to collapse

68 comments

48 comments · 16 top-level

Dylan168077y ago· 5 in thread

Does it support all of C?

oriansj7y ago

Also if you add up the binaries used via dynamic linking you are looking at which in the case of /lib/x86_64-linux-gnu/libc.so.6 is 16MB in size?

Did I mention it works on all x86+amd64 POSIX systems too?

Dylan168077y ago

The un-obfuscated version is commented and readable and still only ~15KB of source code.

derefr7y ago

I imagine that it doesn’t need to support all of C, so much as it needs to support enough of C to compile a more featureful C compiler.

Which makes for a pertinent question: can you compile GCC or Clang with this?

z29LiTp5qUC30nOP7y ago

it is less than 15kb in size written in assembly supports structs, unions, inline assembly, gotos, breaks, function pointers Thus it support more than otcc supported in a fraction of the size

Dylan168077y ago

The file you linked is 200KB.

OTCC is <4KB. It's admittedly using tiny names to squeeze under a size limit, but even when you correct for that it's a lot smaller.

(Edit: At the time of submission the link went to https://github.com/oriansj/stage0/blob/master/Linux%20Bootst... with a description of "World's Smallest C Compiler")

1 more reply

logicallee7y ago· 5 in thread

What special limits can this limbo under? 15 kb is so small.

It's even smaller than l1 cache. (In case you wanted to bruteforce every possible text file you could feed this binary, and wanted to do it all on cache in the CPU).

Maybe it's the size of payload you can put into a usb C cable or something.

I mean it's just so freaking small. Any ideas what limits this is "small enough" to fit under?

Dylan168077y ago

> Maybe it's the size of payload you can put into a usb C cable or something.

(Unless you mean hijacking an existing chip, which might have 0 writable storage or might have a megabyte, who knows.)

logicallee7y ago

yes I meant the latter. maybe that's all a malicious agent has for usable payload for whatever reason. I really had to think hard to come up with that, it's not meant to be realistic.

roywiggins7y ago

It would fit inside the MERCIA relay computer's ROM. http://www.relaiscomputer.nl/index.php/memory

It could comfortably fit inside the Apollo lander computer's RAM three times over.

You can easily build a ROM this big in Minecraft. https://www.youtube.com/watch?v=e4TXjhZLHpw

You could encode it into a relatively short Twitter thread https://qntm.org/twitcodings

logicallee7y ago

but none of these come close to a real world use case. like, where does this limbo bar setting come from? who chose 15k and why?

3 more replies

zik7y ago

It'd fit in an STM32 microcontroller's 64K RAM comfortably, with room to spare for some actual programs. That's actually a pretty useful application. Except I think it targets x86 rather than ARM.

rain17y ago· 4 in thread

of course this isn't the smallest C compiler, it's 5000 lines of code. There is a 500 line C compiler written in C.

reidrac7y ago

"[..]only have the goal of creating a bootstrapping path to a C compiler capable of compiling GCC, with only the explicit requirement of a single 1 KByte binary or less."

I believe the "size" refers to the binary size of the compiler.

sctb7y ago

We've updated the title from the editorialized “World's Smallest C Compiler”.

rain17y ago

Why are people downvoting this comment?

kragen7y ago

Because Hacker News is largely populated by "finance-obsessed man-children and brogrammers", in JWZ's memorable phrase, whose votes it unfortunately weighs as highly as yours, mine, or oriansj's.

0xffff27y ago· 3 in thread

stcredzero7y ago

These days, I would doubt that 70% of people whose primary job is to write code have any knowledge of assembly whatsoever

https://www.corewars.org/index.html

so a naive reading of the above paragraph seems unlikely to succeed.

0xffff27y ago

1 more reply

BanazirGalbasi7y ago

2 more replies

kragen7y ago· 3 in thread

So we're doing our best to get some green shoots established before the situation has any opportunity to get worse.

rain17y ago

If you asked me what the 4 best documents regarding bootstrapping are i'd say:

* Egg of the Phoenix (Blog post) - http://canonical.org/~kragen/eotf/

* The Cuniform Tablets of 2015 (Blue-sky academic research) - http://www.vpri.org/pdf/tr2015004_cuneiform.pdf

* Preventing The Collapse of Civilization (Video) - https://www.youtube.com/watch?v=pW-SOdj4Kkk

* Coding Machines (scifi story about trusting-trust attack) - https://www.teamten.com/lawrence/writings/coding-machines/

kragen7y ago

This is awesome! Thank you! And thank you for the flattering reference to my own thought experiment there.

ahazred8ta6y ago

5000 years after its introduction, cuneiform is still a wedge issue. ... :P

zachrose7y ago· 3 in thread

Does this have any implications for trusting trust?

sansnomme7y ago

Gaelan7y ago

How do you get those zeroes and ones into a machine-readable format, without a trusted text editor or OS?

3 more replies

cestith7y ago

bloak7y ago· 2 in thread

The trouble with bootstrapping GCC is that it requires flex/lex. Have you tried bootstrapping flex/lex?

It sounds stupid, I know, but when I investigated possible paths for getting from zero to GCC, flex looked like the biggest potential obstacle.

There are also C files in GCC/binutils that are generated by complex shell scripts. That was perhaps the second biggest obstacle.

(If I recall correctly, bison's not a problem: old versions of bison don't use bison.)

giomasce7y ago

bloak7y ago

1 more reply

agumonkey7y ago· 2 in thread

Oh so this is similar to kragen 'basement experiment' .. kudos

ahazred8ta7y ago

https://github.com/kragen/stoneknifeforth is 114 non-comment lines of code, compiles to 4063 bytes. (he mentions it's about half the size of the otccelf tinyc)

kragen7y ago

Yeah, SKF could definitely be improved. I'm excited about stage0!

kaushalmodi7y ago· 2 in thread

So cool to see Org mode READMEs in projects unrelated to Emacs!

I hope Github improves Org mode support at some point.

svnpenn7y ago

I like Org mode, but the syntax highlighting support is not great:

https://github.com/wallyqs/org-ruby/issues/64

I have been looking at AsciiDoc:

https://asciidoctor.org/docs/asciidoc-syntax-quick-reference

kaushalmodi7y ago

> I like Org mode, but the syntax highlighting support is not great

I know. That's why I said "I hope Github improves Org mode support at some point." :)

> I have been looking at AsciiDoc

I mean, if you like using Org mode, and Github et al don't see the value in improving support for that, don't use them to render Org mode docs :)

[1]: https://eless.scripter.co/

[2]: https://ox-hugo.scripter.co/

1 more reply

rgoulter7y ago· 1 in thread

What a wonderfully mad, cool goal.

userbinator7y ago

I like to think of exercises like this as a "bootstrap pilgrimage", and I hope the phrase catches on.

edwintorok7y ago· 1 in thread

Does this have similar goes as https://www.gnu.org/software/mes/?

shakna7y ago

stage0 is one of mes' inspirations, so I'd say there's a level of connection there.

[0] https://github.com/oriansj/mes-m2

sittingnut7y ago· 1 in thread

why link to github when main repository is in savannah.gnu.org?

shakna7y ago

The main contributing points seem to be github and gitlab, rather than savannah.

sly0107y ago

Do the instructions in the README count as part of the program?

If not, then the smallest binary to bootstrap a C compiler is actually a single jump to a C compiler in memory with a README containing the memory dump in to be typed in :)

Seriously though, it reminds me of the Toaster Project [0] where an RCA student attempted to build a modern toaster without using the modern supply chain.

[0] http://www.thetoasterproject.org/

cylinder7147y ago

See also Edmund Grimley Evans' bcompiler, mirrored at https://github.com/certik/bcompiler

tomcam7y ago

Read this code if you think you don’t understand assembly. A wonderful project, very clearly written.

kjhughes7y ago

Clickable link: https://github.com/oriansj/stage0/blob/master/Linux%20Bootst...

j / k navigate · click thread line to collapse