Single-use JIT Performance on x86 Processors (opens in new tab)

(github.com)

92 pointsNyan5y ago25 comments

25 comments

18 comments · 2 top-level

MaxBarraclough5y ago· 14 in thread

Looks fun, but impractical. Are there real uses for this kind of thing, on modern architectures?

Related question: has anyone tried to create a high-level language for doing this kind of madness?

vardump5y ago

There are practical applications for this.

Perhaps most importantly for regexp compilation, as it is a fairly common use case [1].

Some other examples I can think of: JIT compiling an eval-statement. Constructing a filter for example for a database scan. CPU based pixel shader. Etc.

A lot of code is JIT compiled, but executed only once. So this was a pretty interesting article with practical performance implications in some scenarios.

[1]: See for example https://www.pcre.org/original/doc/html/pcrejit.html

NyanOP5y ago

Thanks for the info. I'm not particularly familiar with common JIT applications, but I suspect that this use-case is actually more niche than may think.

The problem is that the example presented requires a memory page with write + execute permissions (at the same time). I suspect many JITs don't do this for security reasons (and to deal with OSes which don't allow it), as it may make it easier for an attacker to gain arbitrary code execution. It's likely that many JITs toggle between write and execute permissions, rather than have both enabled at the same time. Whilst this reduces attack surface, changing permissions on allocated memory requires syscalls, which are quite expensive in terms of performance.

The scenario presented in the article avoids the impact of syscalls, to maximize performance, leaving only the impact caused by the processor itself. If a JIT isn't overly concerned with this type of security, using write+execute memory could be a way to avoid syscall overhead. On the other hand, if a JIT does toggle permissions, the syscall overhead is likely much more significant than overheads caused by the processor (although the techniques shown might still help depending on how the JIT engine works).

cat1995y ago

> many JITs don't do this for security reasons

would be nice, but historically it's more been:

- all JITs do this

- OpenBSD creates W^X

- open source JITs and other OS's start to incorporate W^X

- things are now either W^X compatible or haven't been ported to a W^X OS yet.

1 more reply

vardump5y ago

Yeah, the security implications are obvious, R+W+X should not be used with untrusted inputs.

Not that I'd recommend this, but alternatively you could also map exact same memory twice, one with R+X and the other with R+W. The attacker would need to figure out the writable address. Unfortunately there are probably a lot of ways to accidentally leak this information to the attacker...

There are still plenty of use cases where inputs can be trusted.

2 more replies

MaxBarraclough5y ago

Regex is a good example of where JIT can be a good idea for performance, but is it really 'single use'? A compiled regex object would presumably obey the lifetime rules of the host language. It might live as long as the host process, and be reused many times.

I wasn't very clear but I was really thinking of 'self-modifying' code, in contrast to conventional JIT as with regex (once generated, the binary is presumably never modified).

chrisseaton5y ago

Self-modifying code is sometimes used to allow you to update an inline cache for a method dispatch as the program is running. Rather than having to recompile and produce new machine code to update an inline cache.

1 more reply

vardump5y ago

> Regex is a good example of where JIT can be a good idea for performance, but is it really 'single use'?

Ideally not, but take a look at real life code bases... Just like with SQL prepared statements.

This kind of waste happens often due to reduced visibility. Abstraction layers are good at hiding details like construction cost. Repeated construction happens accidentally pretty easily.

pizlonator5y ago

Usually if the code is really single use then the optimal solution is to interpret before JITing.

The more severe case of this in modern JITs is inline cache repatching. That’s super frequent.

Jweb_Guru5y ago

> Usually if the code is really single use then the optimal solution is to interpret before JITing.

This is the conventional wisdom, yes, but recently (for very long-running database queries at least) it seems to be less true.

1 more reply

NyanOP5y ago

> Are there real uses for this kind of thing, on modern architectures?

For me, I came up with an algorithm for doing error correction coding, however, good performance can only be achieved by JIT'ing code. Trying to implement the algorithm without JIT results in many if/switch statements and memory lookups, which makes it much slower. Unfortunately, the JIT'd code can only be used once because a new function needs to be written every time the routine is called, which leads to a scenario like that in the article.

Otherwise, I do think this is somewhat niche, but there may be some interesting applications if the security of write+execute memory is not a concern.

vardump5y ago

Yeah, I was actually thinking about this particular case for code specialization. In code where the inner loop is very branchy, you can have considerable gains for being able to remove unnecessary branches (and code).

This kind of technique was (is?) a fairly common in demoscene. Often just modifying constants in existing code but also specializing (AFAIK usually block concatenation) isn't unheard of.

(By the way, at least on x86, it might pay off to watch out for things like inner loop(s) branch target 16-byte alignment to avoid penalties.)

NyanOP5y ago

Didn't know the practicality of this in the demoscene - thanks for the info!

saagarjha5y ago

Demos are generally more focused on size savings than performance ;)

1 more reply

chrisseaton5y ago

> Are there real uses for this kind of thing, on modern architectures?

Sometimes it's simpler to have all code be compiled rather than interpreted - one representation for everything. For example until fairly recently V8 always compiled everything - no interpreter.

andikleen25y ago· 2 in thread

You have to be a bit careful with the CLFLUSH method. I tried to use it in a widely used program years ago because Intel recommended it, but we found that it just hangs the CPU on some older VIA/Centaur CPUs. Presumably that's fixed these days, but the old CPUs are likely still around.

NyanOP5y ago

Thanks for the info! Unfortunately I don't have access to any VIA/Centaur CPUs, so couldn't test on those (though test results welcome if anyone is willing/able to!).

But yeah, you have to check the CPU you're running on when doing these tricks unfortunately, as results vary greatly across micro-architectures.

Interestingly, there's a more optimal CLFLUSHOPT instruction on more recent processors, which often seems to be quite effective for this task.

ComputerGuru5y ago

Funny. My line of thought was “it’s probably still not fixed, but those old cpus won’t be running new code anytime soon.”

j / k navigate · click thread line to collapse

25 comments

18 comments · 2 top-level

MaxBarraclough5y ago· 14 in thread

Looks fun, but impractical. Are there real uses for this kind of thing, on modern architectures?

Related question: has anyone tried to create a high-level language for doing this kind of madness?

vardump5y ago

There are practical applications for this.

Perhaps most importantly for regexp compilation, as it is a fairly common use case [1].

Some other examples I can think of: JIT compiling an eval-statement. Constructing a filter for example for a database scan. CPU based pixel shader. Etc.

A lot of code is JIT compiled, but executed only once. So this was a pretty interesting article with practical performance implications in some scenarios.

[1]: See for example https://www.pcre.org/original/doc/html/pcrejit.html

NyanOP5y ago

Thanks for the info. I'm not particularly familiar with common JIT applications, but I suspect that this use-case is actually more niche than may think.

cat1995y ago

> many JITs don't do this for security reasons

would be nice, but historically it's more been:

- all JITs do this

- OpenBSD creates W^X

- open source JITs and other OS's start to incorporate W^X

- things are now either W^X compatible or haven't been ported to a W^X OS yet.

1 more reply

vardump5y ago

Yeah, the security implications are obvious, R+W+X should not be used with untrusted inputs.

There are still plenty of use cases where inputs can be trusted.

2 more replies

MaxBarraclough5y ago

I wasn't very clear but I was really thinking of 'self-modifying' code, in contrast to conventional JIT as with regex (once generated, the binary is presumably never modified).

chrisseaton5y ago

1 more reply

vardump5y ago

> Regex is a good example of where JIT can be a good idea for performance, but is it really 'single use'?

Ideally not, but take a look at real life code bases... Just like with SQL prepared statements.

This kind of waste happens often due to reduced visibility. Abstraction layers are good at hiding details like construction cost. Repeated construction happens accidentally pretty easily.

pizlonator5y ago

Usually if the code is really single use then the optimal solution is to interpret before JITing.

The more severe case of this in modern JITs is inline cache repatching. That’s super frequent.

Jweb_Guru5y ago

> Usually if the code is really single use then the optimal solution is to interpret before JITing.

This is the conventional wisdom, yes, but recently (for very long-running database queries at least) it seems to be less true.

1 more reply

NyanOP5y ago

> Are there real uses for this kind of thing, on modern architectures?

Otherwise, I do think this is somewhat niche, but there may be some interesting applications if the security of write+execute memory is not a concern.

vardump5y ago

This kind of technique was (is?) a fairly common in demoscene. Often just modifying constants in existing code but also specializing (AFAIK usually block concatenation) isn't unheard of.

(By the way, at least on x86, it might pay off to watch out for things like inner loop(s) branch target 16-byte alignment to avoid penalties.)

NyanOP5y ago

Didn't know the practicality of this in the demoscene - thanks for the info!

saagarjha5y ago

Demos are generally more focused on size savings than performance ;)

1 more reply

chrisseaton5y ago

> Are there real uses for this kind of thing, on modern architectures?

Sometimes it's simpler to have all code be compiled rather than interpreted - one representation for everything. For example until fairly recently V8 always compiled everything - no interpreter.

andikleen25y ago· 2 in thread

NyanOP5y ago

Thanks for the info! Unfortunately I don't have access to any VIA/Centaur CPUs, so couldn't test on those (though test results welcome if anyone is willing/able to!).

But yeah, you have to check the CPU you're running on when doing these tricks unfortunately, as results vary greatly across micro-architectures.

Interestingly, there's a more optimal CLFLUSHOPT instruction on more recent processors, which often seems to be quite effective for this task.

ComputerGuru5y ago

Funny. My line of thought was “it’s probably still not fixed, but those old cpus won’t be running new code anytime soon.”

j / k navigate · click thread line to collapse