> Very briefly:
> The rev.ng framework is fully open source. You can decompile anything you want from the CLI. > The UI will be available in the following forms: > free to use in the cloud for public projects; > available through a subscription in the cloud for private projects; > available at a cost as a fully standalone, fully offline application.
In comparison, Hopper costs 100 USD with one year of updates [1]. Ghidra and Radare2 are FOSS and completely free to use, IDA Pro costs a fortune
Binja is $300 (or $1500 for commercial, both cheaper for students).
Our view is: the engine is 100% open source. The UI is available for free in the cloud for anyone experimenting, which we define as "I'm OK with leaving the project public".
Basically, the decompiler engine is Free Software, extensible and available for automation/scripting, while the UI is available for free for students/researchers and we can make a living out of professionals (i.e., when your company is paying for it).
This is something all people using decompilers say and sort of shows how low is trust towards decompilers. Expectations have always been rather low.
I've been there, but this does not have to be the case, the whole reason why we started rev.ng is to prove that expectations can be raised.
Apart from accuracy, which is difficult but engineering work, why don't decompilers emit syntactically valid C? Have you ever tried to re-compile code from any decompiler? It's a terrible experience.
rev.ng only emits valid C code, and we test it with a bunch of -Wall -Wextra:
https://github.com/revng/revng-c/blob/develop/share/revng-c/...
Other key topic: data structures. When reversing I spend half of the time renaming things and half of the time detecting data structures. The help I get from decompilers in latter is basically none.
rev.ng, by default, detects data structures on the whole binary, interprocedurally, including arrays. See the linked list example in the blog post. We also have plans to detect enums and other stuff.
Clearly we're not there yet, we still need to work on robustness, but our goal is to increase the confidence in decompilers and actually offer features that save time. Certain tools have made progress in improving the UI and the scripting experience, but there's other things to do beyond that.
I see this a bit like the transition from the phase in which C developers where using macros to ensure things were being inlined/unrolled to the phase where they stopped doing that because compilers got smart enough to the right thing and to do it much more effectively.
Regarding reliability, I would say that Hex-Rays is pretty reliable (at least for x86) if you know its limitations, like throwing away all code in catch blocks. Usually wrong decompilation is caused by either wrong section permissions, or wrong function signature, both of them can be fixed. It can have bad time when stack frame size goes "negative" or some complex dynamic stack array logic is involved, which are usually signs of obfuscation anyway.
It was less reliable 10 years ago though.. Also even now hex-rays weirdly does not support some simple instructions like movbe.
Eg. The next pointer in a linked list should be easy to identify as 'next'.
That would be done by downloading all of GitHub, then seeing what variables in GitHub code have the most similar layouts and interactions, and then if the confidence is high enough, using those names.
However, nowadays, it seems pretty obvious that the right way to do this things is using LLMs.
This said, at this stage, we see ourselves as people building robust infrastructure. Once the infrastructure is there, using some off the shelf model to rename things or add comments is relatively easy.
Basically: we do the hard decompilation work that needs 100% accuracy, and then we can adopt LLMs for things that are OK to be approximate such as names, comments and the like.
Anyway, writing a script that renames stuff is pretty easy. Check out the docs: https://docs.rev.ng/user-manual/model-tutorial/
One could try to train ones own LLM from scratch, using an encoder-decoder (translation - aka seq2seq) architecture trying to predict the correct variable name given the decompiled output.
One could try to use something like GPT-4 with a carefully designed prompt "Given this datastructure, what might be the name for this field?"
One could try to use something pretrained like llama, but then finetune it based on hundreds of thousands of compiled and decompiled programs.
Paper that describes what JSNice is doing behind the scenes: https://files.sri.inf.ethz.ch/website/papers/jsnice15.pdf
And looking at the code contributions: https://github.com/revng/revng/graphs/contributors
Isn't it a bit weird that the CEO (aleclearmind) has most commits, even much more than the CTO (pfez)? I often hear the complaints from other CEOs that they don't really find any time anymore to code... Even the CTO usually is more on the managing side and less active in actual coding.
Anyway, if this works, then I guess it's a lot of fun for them.
Edit Ah right, I didn't check the timeline.
https://github.com/revng/revng-c/commits/develop/
Eventually we'll merge the two repos.
Also, I develop stuff every day. For some reason GitHub is not picking up my user correctly.
> Anyway, if this works, then I guess it's a lot of fun for them.
It is!
So the downvotes are because this is not interesting or not unusual?
[orchestra] [darkstar@shiina revng]$ ./revng artifact --analyze --progress decompile-to-single-file ../maytag.ko
[=======================================] 100% 0.57s Analysis list revng-initial-auto-analysis (5): import-binary
[===================> ] 50% 0.57s Run analyses lists (2): revng-initial-auto-analysis
[=========> ] 25% 0.57s revng-artifact (2): Run analyses
Only ELF executables and ELF dynamic libraries are supported
[orchestra] [darkstar@shiina revng]$ file ../maytag.ko
../maytag.ko: ELF 64-bit LSB relocatable, x86-64, version 1 (FreeBSD), not stripped
Does it not support FreeBSD binaries?Edit: Ah I missed that it doesn't support kernel modules, probably has nothing to do with FreeBSD but the fact that this is not a simple executable
I opened issue #366 for it already
> `source ./environment`
That's a bad omen. I downloaded the tar to find it does indeed set a bunch of environment variables including PATH, though thankfully not LD_LIBRARY_PATH. Mostly prefixed "HARD_" which is maybe unique (REVNG would be a more obvious choice, colliding with existing environment variables is a bad thing).
It sets `AWS_EC2_METADATA_DISABLED="true"` which won't break me (I don't use AWS) but in general seems dubious.
export RPATH_PLACEHOLDER="////////////////////////////////////////////////$ORCHESTRA_ROOT"
export HARD_FLAGS_CXX_CLANG="-stdlib=libc++"
... "-Wl,-rpath,$RPATH_PLACEHOLDER/lib ...
This is suboptimal. The very long PATH setting with mingw32 and gentoo and mips strings in it also looks very fragile.I usually bail when the running instructions include "now mangle your environment variables" because that step is really strongly correlated with programs that don't work properly on my non-ubuntu system. Wiring your application control flow through the launching environment introduces a lot of failure modes - it's not as convenient as it first appears. Very like global variables.
Clang will burn a lot of this stuff in as defaults when you build it if you ask, e.g. `-DCLANG_DEFAULT_CXX_STDLIB=libc++` would remove the stdlib setting environment variable. DEFAULT_SYSROOT is useful too.
Using rpath means you're vulnerable to someone running this script with LD_LIBRARY_PATH set as the environment variable will override your DT_RUNPATH setting in the binaries. The background on this is aggravating. Abbreviating here, '-Wl,rpath' no longer means rpath, it means 'runpath' which is a similar but much less useful construct. The badly documented invocation you probably want is `-Wl,rpath -Wl,--disable-new-dtags` to set rpath instead of set runpath, at which point the loader will ignore LD_LIBRARY_PATH when looking for libraries.
There's a good chance you can completely remove the environment mangling through a combination of setting different flags when building clang, static linking and embedding binaries in other binaries.
Related, your clang-16 binary is dynamically linked. As in it goes looking for things like libLLVMAArch64CodeGen.so.16 at runtime. A lot of failure modes can be removed by LLVM_BUILD_STATIC=ON. E.g. if I run your dynamically linked clang with a module based HPC toolchain active, your compiler will pick up the libraries from the HPC toolchain and it'll have a bad time. The tools are all linked against glibc as well, pros and cons to that.
Tools are also linked against libc++.so, which is linked against libc++abi.so and so forth. Worth considering static libc++, but even if you decline that, libc++abi and libunwind can and probably should be statically linked into the libc++. The above rpath rant? Runpath isn't transitive so dynamic libaries finding other dynamic libraries using runpath (the one you get when you ask for rpath) works really poorly.
Context for there being so many suggestions above - I am completely out of patience with distributing dynamically linked programs on Linux. I don't want a stray environment variable from some program that had `source ourhack` in the readme or a "module system" to reach into my application and rewire what libraries it calls at runtime as the user experience and subsequent bug report overhead is terrible. Static linking is really good in comparison.
Thanks again for shipping, and I hope some of the above feedback is helpful!
In truth, we suggest to do that only so you use the GCC we distribute for the demo binary. The actual way this is intended to be used is through the `./revng` script. In that way, the environment changes only affect the invocation of `revng`.
This is documented here: https://docs.rev.ng/user-manual/working-environment/ We should probably add a warning about `source ./environment`.
Now, let's get to each of your comments :D
> though thankfully not LD_LIBRARY_PATH
We spent a lot of time to have a completely self-contained set of binaries where each ELF refers to its dependencies through relative paths. LD_LIBRARY_PATH is evil.
> Mostly prefixed "HARD_"
Those are just used by our compiler wrappers, I don't think those environment variables collide with anything in practice.
> It sets `AWS_EC2_METADATA_DISABLED="true"`
Original discussion: https://github.com/revng/revng/pull/309#discussion_r12805759...
I guess we could patch the AWS SDK to avoid this. Anyway, it affects only when rev.ng is running in the cloud.
> export RPATH_PLACEHOLDER=... > export HARD_FLAGS_CXX_CLANG=...
Those are used when linking binaries translated by revng. If you're not interested in end-to-end binary translation, they don't matter.
> it means 'runpath' which is a similar but much less useful construct
We specifically want DT_RUNPATH. DT_RPATH is deprecated and there might an use case for overriding our libraries with LD_LIBRARY_PATH.
> There's a good chance you can completely remove the environment mangling
I think your observations concerning "mangling the environment" are only valid for non-private environment variables. The following variables are private: RPATH_PLACEHOLDER, HARD_*, REVNG_*. Also, they are all only for binary translation purposes. We could push them down into some smaller-scoped compiler wrappers, but those make sense only if we can get rid of environment entirely, which we can't because we ship Python.
> a combination of setting different flags when building clang
No, the flags also affect the linker and there's some features of our wrappers that cannot simply be burned in. We can push them in more private places, though.
> a lot of failure modes can be removed > libc++abi and libunwind can and probably should be statically linked into the libc++
We no longer have issues with that, our build system is pretty reliable in that regard. LLVM is just one of the components, these things need to work robustly in general, and they do (with quite some effort).
You seem to be wary of using dynamic linking, we put some effort in it, now it works pretty good and always looks up things in the right place, and without ever hardcoding absolute paths anywhere, nor any install phase that "patches" the binaries. The unpacked directory can be moved wherever you want.
> I am completely out of patience with distributing dynamically linked programs on Linux
You're thinking of some other solution, our solution does not use LD_LIBRARY_PATH and all the binaries reference each other in a robust way using `$ORIGIN`. Try:
./root/bin/python ./root/bin/revng artifact --help
It works.But again, doing `source environment` is mostly for demo purposes, in the actual use case, you just do `./revng` and your environment is untouched.
We ship our Python, but you don't have to use it: you're supposed to just do ./revng (or interact over the network in daemon mode).
Our approach is: use whatever tool you like for scripting as long as it can parse our YAML project file, make changes to it, and then invoke `./revng artifact` (or interact with the daemon): https://docs.rev.ng/user-manual/model-tutorial/
Result: we get to use our Python version (the latest) and you get to use whatever language you like. Then we'll provide on pypi wrappers that help you with that and are compatible with large set of Python versions.
tl;dr Don't `source ./environment`, use `./revng`.
> Thanks again for shipping, and I hope some of the above feedback is helpful!
I'm happy there's someone that cares about this :D
Our next big iteration of this might involve simplifying things a lot by adopting nix + mount namespace to make /nix/store available without root.
Maybe this is not the right place for discussing this, we can chat on our discord server if you'd like :)
You haven't set LD_LIBRARY_PATH but other people will do. Also LIBRARY_PATH, and put other stuff on PATH and so forth. Module systems are especially prone to this, but ending up with .bashrc doing it happens too.
You have granted the user the ability to override parts of the toolchain with environment variables and moving files to various different directories. That's nice. Some compiler devs will appreciate it. Also it's doing the thing Linux recommends for things installed globally so that's defensible.
In exchange, you will get bug reports saying "your product does not work", where the root cause eventually turns out to be "my linker chose a different library to my loader for some internal component". You also lose however many people try the product once, see it immediately fall over and don't take the time to tell you about the experience.
I think that's a bad trade-off. Static linking is my preferred fix, but generally anything that stops forgotten environment variables breaking your software in confusing ways is worth considering.
"He also met a partner in crime, Pietro. Romantically enough, he met him thanks to a book which will turn out to be foundational for company."
Congrats on the launch.
Then I find this book, which seems very dense, but clear. So I ask my advisor if I could buy it and goes like "well, first check out the university library". I check it out and there's a copy, but... it's taken.
Working in the only group that was doing research on compilers I'm like "who dares do compilers stuff out of our group!?".
I go to the library:
Me: who has the book?
Library guy: can't tell you, privacy reasons.
Me: what's the third letter of its surname?
Library guy: Z
Me: what's the second letter of its name?
Library: I
Me: thanks.
I go here: https://www.deib.polimi.it/ita/personale-lista-alfabetica I found him.
Fast forward, we become friends and we start the company together.
> Congrats on the launch.
Thanks! It was a lot of work.
Then we switched to VSCode, which happens to be able to run in the browser. So we added some magic kubernetes sauce and voilà, you got the cloud decompiler with exactly the same user experience as the fully standalone one.
We still need to perform some QA on collaboration, but basically works. One daemon, many clients. Very simple architecture.
I think we got inspiration to do this from a CTF where we were doing "collaboration" using IDA with multiple windows on a X session on a server with multiple cursors. Very cursed, but effective.
Roadmap item: https://rev.ng/roadmap#feature-798
Design pad: https://pad.rev.ng/s/eDHi2PUoP#
It has been working very well. Two regrets:
1. Not rebasing our fork of QEMU for years has put us in a bad spot. But just today a member of our team managed to lift stuff with the latest QEMU. And he has also been able to lift Qualcomm Hexagon code, for which we helped to add support in QEMU. Eventually we'll be the first proper Hexagon decompiler :)
2. Focusing too much on QEMU led our frontend to be tightly coupled with QEMU. It will now take some effort to enable support for additional frontends, non-QEMU based. But not impossible: our idea is to let user add support for a new architecture by defining, in C, a struct for the CPU state and a bunch of functions acting on it. That's it. No need to learn any internal representation.
tl;dr QEMU was a great choice, it worked so well that we didn't work on that part of the codebase for too much time and now there's some technical debt there. But we're addressing it.