JSIR: A High-Level IR for JavaScript (opens in new tab)

(discourse.llvm.org)

84 pointsnnx2mo ago32 comments

32 comments

26 comments · 9 top-level

catapart2mo ago· 4 in thread

asking as someone who is writing a game engine in javascript with the intention to 'transpile' the games' source into a C# project for a native runtime: this provides a map that allows automated translation from javascript source to C# source, right?

gobdovan2mo ago

No. JSIR is primarily for JS -> IR -> JS for analysis and source-to-source transformation. It's not a ready-made bridge for emitting other languages

You could use it as an intermediate form in a JS->C# pipeline, but you still have to define a subset of JavaScript that lowers cleanly to your target C# runtime and implementing the IR->C# lowering yourself.

I'd imagine the hard part is not the IR, but aligning the JavaScript semantics (object model, closures, prototypes etc.) with C# (static type system, different execution model..).

catapart2mo ago

Right on. That makes sense. Thanks for spelling it out!

I do think aligning the semantics will be the easier part, honestly, because I'm only trying to transpile the supported source for the game engine. Since that's all written in typescript and I'm not guaranteeing full parity if you are trying to transpile arbitrary ts/js (only the source that can be parsed the same way the game engine is parsed), I'm expecting it to be a near 1-to-1 conversion. I started writing everything in C# and copied the structure to JS, knowing that this was the eventual plan, so the JS can actually be re-written as C# with a pretty simple regex tokenizer.

My hope, here, is that by having the code morphed into an IR, that the IR would be some kind of well-known IR that - for instance - C# could also be morphed into and - therefore - would allow automatic parsing back and forth. From what you're saying, though, it sounds like IRs don't use a common structure for describing code (I'm guessing because of the semantic misalignment you mention between a wide variety of different paradigms?), so this would only work if I made the map from IR to C# which would be just as complex (or more so) than just regexing my JS into C#. If I've got that right, that's a bummer, but understandable. If I'm wrong, though, happy to learn more!

1 more reply

gavinray2mo ago

It's probably easier to add a JS Guest Language to the CLR than transpile to C#

CLR already has multiple Language front-ends (C#, F#, VB, IronPython)

catapart2mo ago

A solid suggestion, but a big point of porting it to C# is the performance gains, which the CLR would mitigate. I know it'll be faster than running in a browser - where the game will also run - but if you're offering something for "performance", I don't think the time is best spent on making my job of composing the package easier. I think I'd rather try to figure out how to go whole-hog and compile as much of the game into an AOT package as possible. But, for what it's worth, the entire game engine was written in C# and ported into JS for the express purpose of being able to back-port the packaged code into C#. So I'm hoping it's not too onerous to do the native transpilation, either.

croes2mo ago· 3 in thread

IR = Intermediate Representation

https://en.wikipedia.org/wiki/Intermediate_representation

hootz2mo ago

Writers really should remember to write acronyms by their full name at least once at the beginning of an article.

tamimio2mo ago

Thank you, half way through the article and I am thinking infrared.

giorgioz2mo ago

I also didn't know the acronym IR. A good solution is passing the URL to ChatGPT and asking "what does IR mean in this url: "

conartist62mo ago· 3 in thread

They're presenting this under the banner "the need for source to source transformations."

That seems a bit disingenuous given this is not a source-preserving IR! All comments and nonstandard spacing would be completely removed from your code if you gave it a round trip through this format. That doesn't sound like 99.9% source recovery to me...

gobdovan2mo ago

The 99.9% is less impressive than you'd think, currently they're not even keeping the same program behaviour 0.1% of the time. They also mention AST in the pipeline, not CST, so I wouldn't expect source-preservation to be a direct goal.

Also, if you use a nonstandard spacing, I'd say that's on you to preserve a mechanical Source-AST mapping if you want to use any tool that mentions dataflow analysis & transforms.

As a side note, comments are much trickier than non-standard spacing if their positioning is semantic.

fg1372mo ago

Especially as JSDoc is used for type annotations and even type checking in some projects. It's not in ECMA standard but widely used and well supported. The IR as it is implemented today will run into serious trouble.

throawayonthe2mo ago

the comments are preserved afaict from replies in that thread

jhavera2mo ago· 2 in thread

Interesting timing. We have been working on something that takes the opposite design philosophy. JSIR is designed for high-fidelity round-trips back to source, preserving all information a human author put in. That makes sense when the consumer is a human-facing tool like a deobfuscator or transpiler.

We have been exploring what an IR looks like when the author is an AI and the consumer is a compiler, and no human needs to read the output at all. ARIA (aria-ir.org) goes the other direction from JSIR. No source round-trip, no ergonomic abstractions, but first-class intent annotations, declared effects verified at compile time, and compile-time memory safety.

The use cases are orthogonal. JSIR is the right tool when you need to understand and transform code humans wrote. ARIA is the right tool when you want the AI to skip the human-readable layer entirely.

The JSIR paper on combining Gemini and JSIR for deobfuscation is a good example of where the two worlds might intersect. Curious whether you have thought about what properties an IR should have to make that LLM reasoning more reliable.

oldmanhorton2mo ago

> when the author is an AI and the consumer is a compiler, and no human needs to read the output at all.

This seems like a big bet on the assumption that fully autonomous codegen without humans in the loop is imminent if not already present - frankly, I hope you are wrong.

Even if that comes to pass in some cases, I also find it hard to believe that an LLM will ever be able to generate code in any new language at the same level with which it can generate stack overflow-shaped JavaScript and python, because it’ll never have as robust of a training set for new languages.

measurablefunc2mo ago

I recently wrote a simple interpreter for a stack based virtual machine for a Firefox extension to do some basic runtime programming b/c extensions can't generate & evaluate JavaScript at runtime. None of the consumer AIs could generate any code for the stack VM of any moderate complexity even though the language specification could fit on a single page.

We don't have real AI & no one is anywhere near anything that can consistently generate code of moderate complexity w/o bugs or accidental issues like deleting files during basic data processing (something I ran into recently while writing a local semantic search engine for some of my PDFs using open source neural networks).

1 more reply

pizlonator2mo ago· 2 in thread

> Industry trend of building high-level language-specific IRs

"Trend"?

This was always the best practice. It's not a "trend".

sjrd2mo ago

It seems to me that there's a certain "blindness" between two compiler worlds.

Compiler engineers for mostly linear-memory languages tend to only think in terms of SSA, and assume it's the only reasonable way to perform optimizations. That transpires in this particular article: the only difference between an AST and what they call IR is that the latter is SSA-based. So it's like for them something that's not SSA is not a "serious" data structure in which you can perform optimizations, i.e., it can't be an IR.

On the other side, you actually have a bunch of languages, typically GC-based for some reason, whose compilers use expression-based structures. Either in the form of an AST or stack-based IR. These compilers don't lack any optimization opportunities compared to SSA-based ones. However it often happens that compiler authors for those (I am one of them) don't always realize all the optimization set that SSA compilers do, although they could very well be applied in their AST/stack-based IR as well.

gobdovan2mo ago

I think the WASM world is a clear example that bridges the gap you're describing.

You usually compile from SSA to WASM bytecode, and then immediately JIT (Cranelift) by reconstructing an SSA-like graph IR. If you look at the flow, it's basically:

Graph IR -> WASM (stack-based bytecode) -> Graph IR

So the stack-based IR is used as a kind of IR serialization layer. Then I realized that this works well because a stack-based IR is just a linearized encoding of a dataflow graph. The data dependencies are implicit in the stack discipline, but they can be recovered mechanically. Once you see that, the blindness mostly disappears, since the difference between SSA/graph IRs and expression/stack-based IRs is about how the dataflow (mostly around def-use chains) is represented rather than about what optimizations are possible.

Fom there it becomes fairly obvious that graph IR techniques can be applied to expression-based structures as well, since the underlying information is the same, just represented differently.

Didn't look close enough to JSIR, but from looking around (and from building a restricted Source <-> Graph IR on JS for some code transforms), it basically shows you have at least a homomorphic mapping between expression-oriented JS and graph IR, if not even a proper isomorphism (at least in a structured and side-effect-constrained subsets).

1 more reply

fg1372mo ago· 2 in thread

It's funny they bother to bring up the half dead "Google Closure Compiler" as an example.

And my dumb brain still don't understand how IR is "better" than AST after reading this post. Current AST based JS tools working reasonably well, and it's not clear to me how introducing this JSIR helps tool authors or downstream users, when there are all those roadblocks mentioned at the end.

100ms2mo ago

Why is it half dead? They only jettisoned the ancient support library, the compiler itself AFAIK remains best in class and has commits on GitHub as of 15 hours ago

fg1372mo ago

Something as important as https://github.com/google/closure-compiler/issues/2731 that is in ES2022 standard (Yes, released near 4 years ago) is still not implemented. Lots of new code don't compile with it.

It has become a liability in the build process and people are getting rid of it.

sheepscreek2mo ago· 1 in thread

This is exciting stuff!

My interpretation: If the JSIR project can successfully prove bi-directional source to MLIR transformation, it could lead to a new crop of source to source compilers across different languages (as long as they can be lowered to MLIR and back).

Imagine transmorphing Rust to Swift and back. Of course you’d still need to implement or shim any libraries used in the source language. This might help a little bit with C++ to Rust conversions - as more optimizations and analysis would now be possible at the MLIR level. Though I won’t expect unsafe code to magically become safe without some manual intervention.

jeswin2mo ago

For tsonic (https://github.com/tsoniclang/tsonic) which is trying to convert TS to C# and then to native binary via NativeAOT, I took almost the opposite tradeoff from JSIR.

JSIR is optimizing for round-trips back to JavaScript source. But since in language to language conversion teh consumer is a backend emitter (C# in my case), instead of preserving source structure perfectly, my IR preserves resolved semantic facts: types, generic substitutions, overload decisions, package/binding resolution, and other lowering-critical decisions.

I could be wrong, but I suspect transpilers are easier to build if it's lowering oriented (for specific targets).

hajile2mo ago

I want them to finish the official TC39 binary AST proposal. Nearly twice as fast to parse and a bit smaller than minified code makes it a pretty much universally useful proposal.

https://github.com/tc39/proposal-binary-ast

jcuenod2mo ago

I came across this project in the last couple of days too. Being able to decompile from Hermes bytecode sounds awesome.

Here's the repo: https://github.com/google/jsir (it seems not everything is public).

Here's a presentation about it: https://www.youtube.com/watch?v=SY1ft5EXI3I (linked in from the repo)

j / k navigate · click thread line to collapse

32 comments

26 comments · 9 top-level

catapart2mo ago· 4 in thread

gobdovan2mo ago

No. JSIR is primarily for JS -> IR -> JS for analysis and source-to-source transformation. It's not a ready-made bridge for emitting other languages

I'd imagine the hard part is not the IR, but aligning the JavaScript semantics (object model, closures, prototypes etc.) with C# (static type system, different execution model..).

catapart2mo ago

Right on. That makes sense. Thanks for spelling it out!

1 more reply

gavinray2mo ago

It's probably easier to add a JS Guest Language to the CLR than transpile to C#

CLR already has multiple Language front-ends (C#, F#, VB, IronPython)

catapart2mo ago

croes2mo ago· 3 in thread

IR = Intermediate Representation

https://en.wikipedia.org/wiki/Intermediate_representation

hootz2mo ago

Writers really should remember to write acronyms by their full name at least once at the beginning of an article.

tamimio2mo ago

Thank you, half way through the article and I am thinking infrared.

giorgioz2mo ago

I also didn't know the acronym IR. A good solution is passing the URL to ChatGPT and asking "what does IR mean in this url: "

conartist62mo ago· 3 in thread

They're presenting this under the banner "the need for source to source transformations."

gobdovan2mo ago

Also, if you use a nonstandard spacing, I'd say that's on you to preserve a mechanical Source-AST mapping if you want to use any tool that mentions dataflow analysis & transforms.

As a side note, comments are much trickier than non-standard spacing if their positioning is semantic.

fg1372mo ago

throawayonthe2mo ago

the comments are preserved afaict from replies in that thread

jhavera2mo ago· 2 in thread

oldmanhorton2mo ago

> when the author is an AI and the consumer is a compiler, and no human needs to read the output at all.

This seems like a big bet on the assumption that fully autonomous codegen without humans in the loop is imminent if not already present - frankly, I hope you are wrong.

measurablefunc2mo ago

1 more reply

pizlonator2mo ago· 2 in thread

> Industry trend of building high-level language-specific IRs

"Trend"?

This was always the best practice. It's not a "trend".

sjrd2mo ago

It seems to me that there's a certain "blindness" between two compiler worlds.

gobdovan2mo ago

I think the WASM world is a clear example that bridges the gap you're describing.

You usually compile from SSA to WASM bytecode, and then immediately JIT (Cranelift) by reconstructing an SSA-like graph IR. If you look at the flow, it's basically:

Graph IR -> WASM (stack-based bytecode) -> Graph IR

Fom there it becomes fairly obvious that graph IR techniques can be applied to expression-based structures as well, since the underlying information is the same, just represented differently.

1 more reply

fg1372mo ago· 2 in thread

It's funny they bother to bring up the half dead "Google Closure Compiler" as an example.

100ms2mo ago

Why is it half dead? They only jettisoned the ancient support library, the compiler itself AFAIK remains best in class and has commits on GitHub as of 15 hours ago

fg1372mo ago

It has become a liability in the build process and people are getting rid of it.

sheepscreek2mo ago· 1 in thread

This is exciting stuff!

jeswin2mo ago

For tsonic (https://github.com/tsoniclang/tsonic) which is trying to convert TS to C# and then to native binary via NativeAOT, I took almost the opposite tradeoff from JSIR.

I could be wrong, but I suspect transpilers are easier to build if it's lowering oriented (for specific targets).

hajile2mo ago

I want them to finish the official TC39 binary AST proposal. Nearly twice as fast to parse and a bit smaller than minified code makes it a pretty much universally useful proposal.

https://github.com/tc39/proposal-binary-ast

jcuenod2mo ago

I came across this project in the last couple of days too. Being able to decompile from Hermes bytecode sounds awesome.

Here's the repo: https://github.com/google/jsir (it seems not everything is public).

Here's a presentation about it: https://www.youtube.com/watch?v=SY1ft5EXI3I (linked in from the repo)

j / k navigate · click thread line to collapse