https://huggingface.co/LLM4Binary/llm4decompile-22b-v2
There's also a dataset floating around HF which is... I think a popular N64 decomp to pseudo-C? Maybe the Mario one?
Even the game I was a developer on which was published by Eidos in ~1998 is probably lost source. I can't think that anyone has the Visual Source Safe database backup CDs lying around, but I could be wrong.
Anyway, for those old titles I don't think not having source is that much of a problem. I participated in two reimplementations of 1994 XCOM : UFO2000 and OpenXcom, helped the 1oom project (first Master of Orion) and I don't think having original source would have helped much.
I'm part of the effort to decompile Super Smash Bros. Melee, and a fellow contributor recently wrote about how we're doing agent-based decompilation: https://stephenjayakar.com/posts/magic-decomp/
what about: see cool app, decompile it, launch competing app.
(repeat)
https://reorchestrate.com/posts/your-binary-is-no-longer-saf...
I am able to translate multi-thousand line c functions - and reproduce bug-for-bug implementation
The initial motivation is to run benchmarks, though the foundation is flexible and can support many other use cases over time.
It's already proving useful. For example, I can run a benchmark, view the results in a dashboard, and even feed the report into Claude Code to answer questions like: "How did changing X affect the results?" or "What could be improved in the next run?"
But coincidentally this seems like an easy win for generated training data. Take all your code and have a compiler spit out assembly as well as binary. Now your LLM will not only be able to be a compiler but also make that useful and understandable by humans.