https://huggingface.co/LLM4Binary/llm4decompile-22b-v2
There's also a dataset floating around HF which is... I think a popular N64 decomp to pseudo-C? Maybe the Mario one?
Even the game I was a developer on which was published by Eidos in ~1998 is probably lost source. I can't think that anyone has the Visual Source Safe database backup CDs lying around, but I could be wrong.
Anyway, for those old titles I don't think not having source is that much of a problem. I participated in two reimplementations of 1994 XCOM : UFO2000 and OpenXcom, helped the 1oom project (first Master of Orion) and I don't think having original source would have helped much.
But coincidentally this seems like an easy win for generated training data. Take all your code and have a compiler spit out assembly as well as binary. Now your LLM will not only be able to be a compiler but also make that useful and understandable by humans.
https://reorchestrate.com/posts/your-binary-is-no-longer-saf...
I am able to translate multi-thousand line c functions - and reproduce bug-for-bug implementation
I'm part of the effort to decompile Super Smash Bros. Melee, and a fellow contributor recently wrote about how we're doing agent-based decompilation: https://stephenjayakar.com/posts/magic-decomp/
what about: see cool app, decompile it, launch competing app.
(repeat)
The initial motivation is to run benchmarks, though the foundation is flexible and can support many other use cases over time.
It's already proving useful. For example, I can run a benchmark, view the results in a dashboard, and even feed the report into Claude Code to answer questions like: "How did changing X affect the results?" or "What could be improved in the next run?"