They haven't reached inter-procedural static analysis yet, which means they can't solve the big problem: how big is an array? Most of the troubles in C come from that. Whoever creates the array knows how big it is. Everybody else is guessing.
A bit of machine learning might help here. If you see
void dosomethingwitharray(int arr[], size_t n) {}
a good conjecture is that n is the length of arr. So, the question is, if
this is translated to fn dosomethingwitharray(arr: &[i64]) {}
does it break anything? Both caller and callee have to be analyzed. The C caller
has the constraint assert_eq!(arr.len(), n);
That's a proof goal. If a simple SMT-type prover can prove that true., then the call can be simplified to just use an ordinary Rust slice. If not, conversion to Rust has to drop to those ugly C pointer forms, preferably with a comment inserted. So you need something that makes good guesses, which is a large language model kind of thing, and something which checks them, which is a formalism kind of thing.The process can be assisted by putting asserts in the original C, as checks on the C and hints to the conversion process. That's probably the cleanest way to provide human assistance.
I've wanted this for conversion of OpenJPEG code to Rust. That's a tangle of code doing wavelet transforms, with long blocks of touchy subscripting and arithmetic, plus encoders and decoders for an overly complex binary format containing offsets and lengths. Someone recently ran it through c2rust. The unsafe Rust code works. It's compatible with the original C - it segfaults for the same test cases which cause the C code to segfault. This is why a naive transpiler isn't too helpful.
(The date at the bottom of the article is 2022-06-13. Has there been further progress?)
As an old osdev currently enjoying using rust instead, I would say I wish.
N might be the length of arr. it might also correspond to the number of elements of (implicit) type t that would fit in the unsigned char array arr. It might be the length of the array minus space for a trailing char (either minus one or minus sizeof(char) bytes). Or it could be the size plus one, because why not.
Using something like GPT-4 on this problem is promising. It's probably going to be right most of the time, and its errors can be caught by the next phase of the analysis. That's about what you'd get if you put junior programmers on language conversion.
The date was wrong; sorry, my mistake. The article reflects progress as of early January 2023. We're actively working on the lifting feature and will post a follow-up post once the tooling are sufficiently mature to be tested by the community.
The article links to their github repo:
https://github.com/immunant/c2rust
There's commits in the last hour, so at least some signal of life.
probably the abi if nothing else?
It was easy to get the Rust code compiled and working as a drop-in-replacement for the C Library. This has been a big help with refactoring the unsafe Rust code into safe Rust (manual work). OpenJpeg has a great testsuite that has allowed testing that each refactor step doesn't add new bugs (has happened at least 3 times).
The original run of c2rust generated 96,842 lines of Rust code (about 1 year ago), now it is down to 46,873 lines code. A lot of the extra 50k lines of code were from C macros that got expanded and from constant lookup tables (C code had 10-30 values per line, Rust 1 value 1 line).
For anyone looking to use c2rust to port C code to Rust, I recommend the following:
1. Setup some automated testing if it doesn't exist already.
2. Do refactoring in small amounts, run the tests and commit the changes before doing more refactoring.
3. Use "search/replace" tools (`sed`) to help with rewriting common patterns. Make sure to follow #2 when doing this.
4. Don't re-organize the code until after most of the unsafe code has been rewritten. This will allow easier side-by-side comparison with the original C code.
5. c2rust expands macros and constants from `#define`. Being able to do side-by-side comparison of the C code will help with adding constants back in and removing expanded code with Rust macros or just normal Rust functions.
[0] https://github.com/Neopallium/openjpeg/tree/master/openjp2-r... pub fn insertion_sort(n: i32, p: &mut [i32]) {
for i in 1..n as usize {
let tmp = p[i];
let mut j = i;
while j > 0 && p[j - 1] > tmp {
p[j] = p[j - 1];
j -= 1;
}
p[j] = tmp;
}
}
fn main() {
let mut arr1: [i32; 3] = [1, 3, 2];
insertion_sort(3, &mut arr1);
// …
}
I guess if this actually works, we can translate massive amounts of internal C libraries into human readable Rust... good stuff.(funnily enough, passing in the "original" code without the `unsafe extern "C"` part makes it produce the exact same output as the above)
Here, who says the idiomatic translation is not .sort()? It should use the stdlib.
In general translations should stay as close to the original as possible, while eliminating any possibility of segfaults.
Would love to see a technical write up of someone outside Immunant using this on a real world codebase for whatever purpose.
I think this is your problem; to my understanding it's not really the point of the project. The resulting code is meant to be something you can gradually refactor, not something that's immediately better or more understandable. Even if a given piece of code is harder to refactor, it's still important on a large pre-existing project to be able to immediately switch over to the new toolchain all at once, without having to manually refactor/rewrite all of the code all at once
And I wanted an excuse to play around with rust some more :D
1. there are three levels of refactoring: removing the extensive (unbearable, to be honest) boilerplate that C2Rust introduces; converting the design from "C with Rust syntax" to safe Rust; convert the design from unidiomatic Rust to idiomatic
2. as another poster pointed out, for non-trivial projects, writing refactoring tooling is a must (to remove the C2Rust boilerplate), in order to perform step 1
3. design refactoring (step 3) difficulty depends on the source code design; the code I worked with was relatively hard to refactor, as it was old (school), in particular, lots of globals; the difficulty was caused by the typical freedoms that C gives and Rust doesn't (in other words, the very obvious design differences between C and Rust); somebody did a C to Rust port of (I think) Zstd, which is a modern codebase, and I think much easier to work with (also because of less, or possibly no, external dependencies)
4. regarding the code understanding, if one performs the translation in the three-steps mentioned in point 1, at the end of step 2, one has effectively a safe Rust codebase, "just" unidiomatic
5. in terms of quantity of changes (but not time spent), it's possible to perform the bulk of step 3 with rather local thinking (understanding), but of course, most of the time spent is on major design changes
6. beside a few steps, I was able to perform a conversion in self-contained steps, which is very good news for this type of work. Even better, it's possible (but that's a niche case) to port an SDL project by using at the same time the C library and the Rust one!
7. however, I can imagine projects like Wolfenstein 3d to be very hard to port, since it's hard to port memory allocators and similar
99. most important of all: just converting to Rust will quickly (even immediately) find bugs in the source; I've found approximately four bugs in the source code, including one by Carmack!
All in all, I find this tool great, but somebody needs to work on refactoring tools, and C2Rust's output must be improved in order to be found usable by the public.
Definitely will thumb through the git history to get an idea of the refactoring efforts.
Thanks a bunch!
> this provides a starting point for manual refactoring into idiomatic and safe Rust
- the base unit is the individual C file, which causes structs and symbols to be duplicated across Rust modules
- for loops are translated to while loops with overflowing additions, which is ugly and unnecessary in pretty much every case (this makes sense, semantically, but it could be used only when necessary, not as general strategy)
- variables are declared at the top of the functions (AFAIR)
C2Rust generates code that requires significant refactorings _before_ semantic (C->Rust) translations - as a matter of fact, they had a refactoring tool, but it's been temporarily deprecated.
It's a fantastic tool, but as of now, it requires developers to write their own refactoring tools.
(For anyone else who found it slightly difficult to read, you can remove the added 0.06em `letter-spacing` using your browser's developer tools.)
I could see a really compelling use case in cross-compilation where you compile your C code to Rust, then use a Rust toolchain to cross compile. Or avoiding interop as well.
Calling c directly is already possible in rust.
The C2Rust project is being developed by Galois and Immunant. This tool is able to translate most C modules into semantically equivalent Rust code. These modules are intended to be compiled in isolation in order to produce compatible object files. We are developing several tools that help transform the initial Rust sources into idiomatic Rust.
The translator focuses on supporting the C99 standard. C source code is parsed and typechecked using clang before being translated by our tool.
You can debate the merits of doing so, of course, but some people do want to do that, and a tool to generate safe, somewhat idiomatic Rust from C code would seem to be useful.
This gains us machine-proved memory safety. This is huge.
C2rust is a springboard, if you move C2rust-Ed code to production you’re doing it very wrong.
Not that it’s a good idea, but I could see a scenario where it would be worthwhile.
When it goes wrong, the advice will include "write better comments, so the transpiler knows what you're doing". Proponents will liken this to type-linting comments. Critics will liken this to INTERCAL / p-hacking / tax fraud, and will claim that the transpiler can be mislead by confusing comments. Proponents will show you that GPT-4 can identify misleading comments in the critics' examples. Critics will say "real code won't contain comments like that, so this ability is useless". Proponents will say "oh, yeah, that too I guess". Critics will promptly vanish in a puff of logic.
The manually-written tool will get better: more slowly at first, but more steadily, and with only a few (predictable, fixable) correctness bugs. Eventually, it will be able to correctly process more programs than the leading GPT-4 approach can. It will be months before anyone notices this, since the two camps (manual approach, GPT-4 approach) will not really be talking to each other enough.
Eventually, somebody will write a blog post about a semi-obscure but representative benchmark (perhaps the Linux kernel), pointing out that the manual tool works better now. There will be a brief wave of hype about the "new tool" and the "death of AI". Then some people will fine-tune the model on tricksy cases using the manually-written tool's output, some other people will call that utterly cheating, and the hype will give way to bickering.
Realistic? Well… GPT-4 is proprietary, and we've got more efficient LLM architectures now – but I think the sorts of people to make a tool like this will probably stick with OpenAI's APIs. (It's In The Cloud.™)