Related question: has anyone tried to create a high-level language for doing this kind of madness?
Perhaps most importantly for regexp compilation, as it is a fairly common use case [1].
Some other examples I can think of: JIT compiling an eval-statement. Constructing a filter for example for a database scan. CPU based pixel shader. Etc.
A lot of code is JIT compiled, but executed only once. So this was a pretty interesting article with practical performance implications in some scenarios.
[1]: See for example https://www.pcre.org/original/doc/html/pcrejit.html
The problem is that the example presented requires a memory page with write + execute permissions (at the same time). I suspect many JITs don't do this for security reasons (and to deal with OSes which don't allow it), as it may make it easier for an attacker to gain arbitrary code execution. It's likely that many JITs toggle between write and execute permissions, rather than have both enabled at the same time. Whilst this reduces attack surface, changing permissions on allocated memory requires syscalls, which are quite expensive in terms of performance.
The scenario presented in the article avoids the impact of syscalls, to maximize performance, leaving only the impact caused by the processor itself. If a JIT isn't overly concerned with this type of security, using write+execute memory could be a way to avoid syscall overhead. On the other hand, if a JIT does toggle permissions, the syscall overhead is likely much more significant than overheads caused by the processor (although the techniques shown might still help depending on how the JIT engine works).
would be nice, but historically it's more been:
- all JITs do this
- OpenBSD creates W^X
- open source JITs and other OS's start to incorporate W^X
- things are now either W^X compatible or haven't been ported to a W^X OS yet.
Not that I'd recommend this, but alternatively you could also map exact same memory twice, one with R+X and the other with R+W. The attacker would need to figure out the writable address. Unfortunately there are probably a lot of ways to accidentally leak this information to the attacker...
There are still plenty of use cases where inputs can be trusted.
I wasn't very clear but I was really thinking of 'self-modifying' code, in contrast to conventional JIT as with regex (once generated, the binary is presumably never modified).
Ideally not, but take a look at real life code bases... Just like with SQL prepared statements.
This kind of waste happens often due to reduced visibility. Abstraction layers are good at hiding details like construction cost. Repeated construction happens accidentally pretty easily.
The more severe case of this in modern JITs is inline cache repatching. That’s super frequent.
This is the conventional wisdom, yes, but recently (for very long-running database queries at least) it seems to be less true.
For me, I came up with an algorithm for doing error correction coding, however, good performance can only be achieved by JIT'ing code. Trying to implement the algorithm without JIT results in many if/switch statements and memory lookups, which makes it much slower. Unfortunately, the JIT'd code can only be used once because a new function needs to be written every time the routine is called, which leads to a scenario like that in the article.
Otherwise, I do think this is somewhat niche, but there may be some interesting applications if the security of write+execute memory is not a concern.
This kind of technique was (is?) a fairly common in demoscene. Often just modifying constants in existing code but also specializing (AFAIK usually block concatenation) isn't unheard of.
(By the way, at least on x86, it might pay off to watch out for things like inner loop(s) branch target 16-byte alignment to avoid penalties.)
Sometimes it's simpler to have all code be compiled rather than interpreted - one representation for everything. For example until fairly recently V8 always compiled everything - no interpreter.
But yeah, you have to check the CPU you're running on when doing these tricks unfortunately, as results vary greatly across micro-architectures.
Interestingly, there's a more optimal CLFLUSHOPT instruction on more recent processors, which often seems to be quite effective for this task.