Because Pnut doesn't support certain C features used in TCC, Pnut features a native code backend that supports a larger subset of C99. We call this compiler `pnut-exe`, and it can be compiled using `pnut.sh`. This makes it possible to compile `pnut-exe.c` using `pnut.sh`, and then compile TCC, all from a POSIX shell."
Anywhere we can see a step-by-step demo of this process.
Curious if the authors tried NetBSD or OpenBSD, or using another small C compiler, e.g., pcc.
Historically, tcc was problematic for NetBSD and its forks. Not sure about today, but tcc is still in NetBSD pkgsrc WIP which suggests problems remain.
- a shell is required, which has to be built from sources, using a compiler which was also built from sources using a compile binary. That's the real boostrap.
- even if you could pick some shell, and compiled it with pnut.exe, the compiled code requires interpretation by an executable shell.
- there is no such thing as a "POSIX compliant shell"; that's an abstract category. All this amounts to is a promise that pnut.sh will not generate code that uses non-POSIX features.
open(..., O_RDWR | O_EXCL) -> runtime error, "echo "Unknow file mode" ; exit 1"
lseek(fd, 1, SEEK_HOLE); -> invalid code (uses undefined _lseek)
socket(AF_UNIX, SOCK_STREAM, 0); -> same (uses undefined _socket)
looking closer at "cp" and "cat" examples, write() call does not handle errors at all. Forget about partial writes, it does not even return -1 on failures.
"Compiler you can Trust", indeed... maybe you can trust it to get all the details wrong?
Otherwise the builtins seems to be here https://github.com/udem-dlteam/pnut/blob/main/runtime.sh
FYI all your functions are not "C functions", but rather POSIX functions. I did not expect it to be complete, but it's still impressive for what it is.
I don't remember there being a way to keep a server listening on a /dev/tcp/$ip/$port port, for sockets from shell scripts with shellcheck at least
I think the pitch here is that it can compile TCC which can then compile GCC which makes it much more difficult for a backdoor to survive potentially, especially if the shell code is easier to read and verify than the corresponding assembly.
Within that context, an incomplete libc is irrelevant.
A C to shell compiler might seem impractical, but you know what is even more impractical? Having a separate language for a build system. And yet, here we are. Using Shell, Make or CMake to build a C program is only acceptable because is has always been so. It's a "perceived normality" in the C world.
There is no good reason, however, CMake isn't a C library. With build system being a library, we could write, read, and, most importantly, debug build scripts just like any other part of the buildable. We already have includeOS, why not includeMake?
Nah, using shell, make or cmake is acceptable because C is obviously a terrible language for doing things. (Those languages are also all terrible, but not quite as terrible as C).
> There is no good reason, however, CMake isn't a C library.
Isn't it the other way round? There's no good reason people write programs in C rather than CMake.
> With build system being a library, we could write, read, and, most importantly, debug build scripts just like any other part of the buildable.
Which is to say, with extreme difficulty?
Like, I agree with where you're coming from, it is absolutely a damning indictment of C that people don't want to express their builds in it. But writing in a build in C really would be terrible.
What Pnut shows us is that the language itself is a very thin construct. C could be as low-level as you want, but it can also... compile to shell. Pnut shows that C is only a set of grammatical rules, and the source code in C doesn't necessary reflect the binary program, it's only a script for the C compiler. A compiler then decides how to interpret the source and what to do with it.
Now back to builds. The difference between:
set(SOME_VARIABLE "SOME VALUE")
and set(SOME_VARIABLE, "SOME VALUE");
is purely grammatical. The underlying functionality is the same. When I'm saying, CMake could be a C library, I'm not saying we should ditch CMake and everything it brings to the table and start writing build scripts in pure C. I'm saying we can use both C language and CMake functionality with very little, skin deep, adjustments.The only thing that keeps us down is the perception of C as a low-level language for low-level applications. C is for drivers and shell is for moving files around. And that's when Pnut comes up and tells us: "hold on, are they?"
I disagree. For a very simple example it really makes life easier to not have to care about quoting filenames in build systems and just list a.c b.cpp etc., while you really want strings to be quoted in normal programming languages. Build systems that tried to be based on syntax of existing PLs (for instance Meson, QBS) are a real PITA for me when compared to CMake due to a lot of such affordances.
Why is it you think that?
https://25thandClement.com/~william/2023/base64.sh
If this project had existed I might have opted to compile my C-based base-64 encoder and decoder routines, suitably tweaked for pnut's limitations.I say base64.sh is mostly pure not because it relies on shell extensions, but because the only non-builtins it depends on are od(1) or, alternatively, dd(1) to assist with binary I/O. And preferably od(1), as reading certain control characters, like NUL, into a shell variable is especially dubious. The encoder is designed to operate on a stream of decimal encoded bytes. (See decimals_fast for using od to encode stdin to decimals, and decimals_slow for using dd for the same.)
It looks like pnut uses `read -r` for reading input. In addition to NULs and related raw byte issues, I was worried about chunking issues (e.g. truncation or errors) on binary data, e.g. no newlines within LINE_BUF bytes. Have you tested binary I/O much? Relatedly, how many different shell implementations have you tested your core scheme with? In addition to bash, dash, and various incarnations of /bin/sh on the BSDs, I also tested base64.sh with Solaris' system shells (ksh88 and ksh93 derivatives), as well as AIX's (ksh88 derivative). AIX had some odd quirks with pipelines even with plain text I/O. (Unfortunately Polar Home is gone, now, so I have no easy way to play with AIX; maybe that's for the better.)
https://github.com/udem-dlteam/pnut/blob/main/examples/compiled/base64.sh
It doesn't support NULs as you pointed out, but it's interesting to see similarities between your implementation and the one generated by Pnut.Because we use `read -r`, we haven't tested reading binary files. Fortunately, the shell's `printf` function can emit all 256 characters so Pnut can at least output binary files. This makes it possible for Pnut to have a x86 backend for the use of reproducible builds.
Regarding the use of `read`, one constraint we set ourselves when writing Pnut is to not use any external utilities, including those that are specified by the POSIX standard (other than `read` and `printf`). This maximizes portability of the code generated by Pnut and is enough for the reproducible build use case.
We're still looking for ways to integrate existing shell code with C. One way this can be done is through the use of the `#include_shell` directive which includes existing shell code in the generated shell script. This makes it possible to call the necessary utilities to read raw bytes without having Pnut itself depends on less portable utilities.
I'd choose a different example to showcase pnut.
That's correct! Unlike Bash and other modern shells, the POSIX standard doesn't include arrays or any other data structures. The way we found around this limitation is to use arithmetic expansion and indexed shell variables (that are starting with `_` as you noted) to get random memory access.
Maybe then I can also interest you in an exception handler for DOS batch scripts:
Amber: Programming language compiled to Bash https://news.ycombinator.com/item?id=40431835 (318 comments)
---
Pnut doesn't seem to differentiate between `int' and `int*' function parameters. That's weird, and doesn't come across as trustworthy at all! Shouldn't the use of pointers be disallowed instead?
int test1(int a, int len) {
return a;
}
int test2(int* a, int len) {
return a;
}
Both compile to the exact same thing: : $((len = a = 0))
_test1() { let a $2; let len $3
: $(($1 = a))
endlet $1 len a
}
: $((len = a = 0))
_test2() { let a $2; let len $3
: $(($1 = a))
endlet $1 len a
}
The "runtime library" portion at the bottom of every script is nigh unreadable.Even still, it's a cool concept.
Is there a plan to remove such limitations?
edit: For reference, someone's take on building out better bash-like array functionality in posix shell: https://github.com/friendly-bits/POSIX-arrays (there's only very rudimentary array support built-in to posix sh, basically working with stuff in $@ using set -- arg1 arg2..)
ksh93: 31s
dash: 1m06s
bash: 1m19s
zsh: >15m
[0]: https://git.kernel.org/pub/scm/utils/dash/dash.git/tree/src/...EDIT: ksh93, not ksh
Autoconf is a perl program that turns (heavily customized) m4 files into shell scripts. How does a C compiler help there?
Oof, did not realize.
We would benefit from steering away from auto-generated scripts. Autoconf included.
Even POSIX standard ones. Chokes on:
#include <glob.h>
int main() // must be (); (void) results in syntax error.
{
glob_t gb; // syntax error here
glob("abc", 0, NULL, &gb);
return 0;
}
Nobody needs entirely self-contained C programs with no libraries to be turned into shell scripts; Unix people switch to C when there is a library function they need to call for which there no command in /bin or /usr/bin.If I reduce it to:
#include <glob.h>
int main()
{
glob("abc", 0, NULL, 0);
return 0;
}
it "compiles" into something with a main function like: _main() {
defstr __str_0 "abc"
_glob __ $__str_0 0 $_NULL 0
: $(($1 = 0))
}
but what good is that without a definition of _glob.Quite frankly I think Bash scripting is awful and frequently wish shell scripts were written in a real and debuggable language. For anything non-trivial that is.
I feel like I’d rather write C and compile it with Cosmopolitan C to give me a cross-platform binary than this.
Neat project. Definitely clever. But it’s headed in the opposite direction from what I’d prefer...
The programmer, who was very proud of his mastery of C, said: “How can this be? C is the language in which the very kernel of Unix is implemented!”
Master Foo replied: “That is so. Nevertheless, there is more Unix-nature in one line of shell script than there is in ten thousand lines of C.”
The programmer grew distressed. “But through the C language we experience the enlightenment of the Patriarch Ritchie! We become as one with the operating system and the machine, reaping matchless performance!”
Master Foo replied: “All that you say is true. But there is still more Unix-nature in one line of shell script than there is in ten thousand lines of C.”
The programmer scoffed at Master Foo and rose to depart. But Master Foo nodded to his student Nubi, who wrote a line of shell script on a nearby whiteboard, and said: “Master programmer, consider this pipeline. Implemented in pure C, would it not span ten thousand lines?”
The programmer muttered through his beard, contemplating what Nubi had written. Finally he agreed that it was so.
“And how many hours would you require to implement and debug that C program?” asked Nubi.
“Many,” admitted the visiting programmer. “But only a fool would spend the time to do that when so many more worthy tasks await him.”
“And who better understands the Unix-nature?” Master Foo asked. “Is it he who writes the ten thousand lines, or he who, perceiving the emptiness of the task, gains merit by not coding?”
Upon hearing this, the programmer was enlightened.
Master Foo is shorthand for Fool.
GNU Mes: https://www.gnu.org/software/mes/
Stage0: https://bootstrapping.miraheze.org/wiki/Stage0
Ribbit (same authors): https://github.com/udem-dlteam/ribbit
stage0-posix: https://github.com/oriansj/stage0-posix
Bootstrappable Builds: https://bootstrappable.org/
See also this LWN article about bootstrappable and reproducible builds: https://lwn.net/Articles/841797/ It contains a plethora of interesting links.
I.e., you can take your compiled.sh and run in an obscure processor with an obscure OS, as long as it's POSIX, it should work...
I suppose the trust moves to the shell executable then, but at least you could run the bootstrapping with multiple shells and expect identical output.
because Bash goes brrrr
You can:
> After nearly one year of development, I'm pleased to announce our version 3.0 release of the Cosmopolitan library. [...] we invented a new linker that lets you build fat binaries which can run on these platforms: AMD ... ARM64
https://github.com/jart/cosmopolitan/releases/tag/3.5.3
> This release fixes Android support. You can now run LLMs on your phone using Cosmopolitan software like llamafile. See 78d3b86 for further details. Thank you @aj47 (techfren.net) for bug reports and and testing efforts.
In examples/compiled/cat.sh line 7:
: $((_$__ALLOC = $2)) # Track object size
^-- SC1102 (error): Shells disambiguate $(( differently or not at all. For $(command substitution), add space after $( . For $((arithmetics)), fix parsing errors.
^-----------------^ SC2046 (warning): Quote this to prevent word splitting.
^--------------^ SC2205 (warning): (..) is a subshell. Did you mean [ .. ], a test expression?
^-- SC2283 (error): Remove spaces around = to assign (or use [ ] to compare, or quote '=' if literal).
^-- SC2086 (info): Double quote to prevent globbing and word splitting.
It seems to be parsing the arithmetic expansion as a command substitution, which then causes the analyzer to produce errors that aren't relevant. ShellCheck's own documentation[0] mention this in the exceptions section, and the code is generated such that quoting and word splitting are not an issue (because variables never contain whitespace or special characters).It also warns about `let` being undefined in POSIX shell, but `let` is defined in the shell script so it's a false positive that's caused by the use of the `let` keyword specifically.
If you think there are other issues or ways to improve Pnut's compatibility with Shellcheck, please let us know!
The C to shell transpiler I'm aware of will output unreadable code (elvm using 8cc with sh backend)
Trying to compile with this tool fails with "comp_glo_decl: unexpected declaration"
The `sum` example doesn't seem to do wrapping, but signed int overflow is technically UB so I guess they're fine not to.
Switching it to `unsigned int` gives me:
code.c:1:1 syntax error: unsupported type
int why(int unused) {
wat_why_does_this_compile;
no_error_checking();
}Because all shell variables in code generated by pnut are numbers, variables never contain whitespace or special characters and don't need to be quoted. We considered quoting all variable expansions as this is generally seen as best practice in shell programming, but thought it hurt readability and decided not to.
If you think there are other issues, please let me know!
Super neat project, btw!