Bringing Faster Exceptions to Rust (opens in new tab)

(purplesyringa.moe)

103 pointsstpn1y ago34 comments

34 comments

21 comments · 5 top-level

codeflo1y ago· 9 in thread

> Returning to an alternate address shouldn’t be significantly more expensive than returning to the default address, so this has to be cheap.

Modern CPUs add complications to arguments like this. Branches stall the execution pipeline, so branch prediction was invented to keep the pipeline flowing. Return instructions are perfectly predicted, which makes them literally free. At the very least, any alternate return scheme has to pay for a full misprediction. That can be expensive.

tsimionescu1y ago

This doesn't really make sense. The branch predictor relies on a history of previous executions of that branch, or on explicit hints, to decide if a branch will be taken or not. Based on this prediction, the speculative execution hardware then sees the jump (return/panic) and loads the code from that address into the icache. There is 0 difference between `if (condition) jump $panic_recover_address` and `if (condition) jump $function_return_address` in terms of how easy or hard it is to predict or speculatively load based on the prediction.

beng-nl1y ago

Not that you’re wrong, but Returns aren’t predicted using the branch predictor, but with the RSB (return stack buffer) which stores the return addresses of the current call stack. The x86 optimization manual (starting quite a few years ago) explicitly mentions calls and rets should match for best performance.

codeflo1y ago

On x86, ret and call are explicit instructions. Ret always predicts the address of the last call, which is (usually) 100% accurate. Your example of `if (condition) jump $panic_recover_address` contains two branches, either of which can be mispredicted.

2 more replies

amluto1y ago

It may well be possible to do an alternate return, skipping a frame, that is, itself, very reliably predicted correctly. But it still looks like:

    CALL
    jump-without-RET

and the calls and the rets don’t line up. This defeats the return prediction on the next return.

jnordwick1y ago

I thought the Branch Target Predictor on x64 was global, not local, and it has to kick in before decode so even direct branches can be mispredicted. Branch prediction is 2 parts - the conditional predictor and the target predictor. The conditional predictor is actually per 64 byte instruction block (so if you have a few branches consecutively they share branch predictor entries and can step on each other. the target predictor uses a global history and needs to happen very early to keep the front end fed.

1 more reply

IshKebab1y ago

Modern CPUs have a "return address stack" which basically mirrors the real stack and allows them to perfectly predict returns (for normal code anyway).

First explanation I found on Google. Haven't read it:

https://one2bla.me/cs6290/lesson4/return-address-stack.html

gpderetta1y ago

returns, conditional jumps and indirect jumps have each fairly different prediction profiles. In particular paired call+ret are predicted perfectly at least until the dedicated internal stack doesn't overflow; indirect jumps are, as a general rule, less predictable than conditional jumps as more state (the jump target) need to stored.

kaba01y ago

But in the given context, returning a Return type will almost by necessity involve a conditional at the caller site, so for an apples to apples comparison that should be compared, not a linear return and nothing else.

meindnoch1y ago

>Return instructions are perfectly predicted

As long as you don't overwrite the return address on the stack.

Joker_vD1y ago· 3 in thread

Well, unwinding can be as simple (and as fast) as

    MOV  SP, [installed_handler]
    JMP  [installed_handler+WORD]

but it only works if you don't need to run the defers/Drop's/destructors/etc. for stuff that's on the stack between the current frame and the handler's frame. Which you do, most of the time.

gpderetta1y ago

> it only works if you don't need to run the defers/Drop's/destructors/etc

Indeed. And the per frame cleanup is also language agnostic which adds overhead; it also must support both sjlj and dwarf frames[1]; it is also done in two phases: destructors are only run if an actual catch is found: an unhandled exception doesn't run destructors to preserve state in the core file. This requires a two-phase unwinding that again slows things down.

Another big bottleneck that might not be captured in OP test is that the unwinder has to take (or used to, things got better recently) a global lock to prevent races with dlclose, which greatly limit scalability of exception handling.

Still very nice improvements from OP.

[1] although I'm not sure you can mix them in the same program or it is a platform-wide decision.

Joker_vD1y ago

> the unwinder has to take (or used to, things got better recently) a global lock to prevent races with dlclose

If someone from another thread decided to unload a library whose code is still being executed in this thread then this thread would normally crash anyhow, and do so irrecoverably, right?

1 more reply

Yoric1y ago

Also, I don't think it's possible to perform any kind of backwards-compatible static analysis that would tell you when it's safe to just JMP. Unless you have full information, perhaps (at the very least everything already specialized).

mmaniac1y ago· 2 in thread

The most interesting part of this article for me is at the beginning.

> Now imagine that calls could specify alternate return points, letting the callee decide the statement to return to:

  // Dreamed-up syntax
  fn f() {
        g() alternate |x| {
          dbg!(x); // x = 123
      };
  }

  fn g() -> () alternate i32 {
      return_alternate 123;
  }

This sort of nonlocal control flow immediately calls to mind an implementation in terms of continuation passing style, where the return point is given as a function which is tail called. Nonlocal returns and multiple returns are easy to implement in this style.

Does there exist a language where some function

  fn foo() -> T throws U

is syntactic sugar for something more like?

  fn foo(ifOk: fn(T) -> !, ifExcept: fn(U) -> !) -> !

compressedgas1y ago

See Multi-return Function Call by Olin Shivers and David Fisher https://www.khoury.northeastern.edu/home/shivers/papers/mrlc...

gpderetta1y ago

scheme I guess?

weinzierl1y ago· 1 in thread

Great analysis of unwinding overhead in Rust. The framing of exceptions as "alternate returns" is enlightening - they should be cheap in theory, which makes the current ~2.3μs overhead particularly interesting to dissect. The optimization journey from removing unnecessary type erasure to using thread-locals is well explained. While the 4.3x speedup is impressive, I think the bigger takeaway isn't about replacing Result with panics, but rather identifying where non-local control flow genuinely makes sense. Looking forward to the deep dive into unwinder implementations.

vlovich1231y ago

Should be cheaper but the expensive part not discussed has to be snapshotting the stack which feels expensive and that’s what the panic information is supposed to preserve. That’s why they got an extra 5x performance improvement but not 10 or 100x and didn’t provide a benchmark of how frequently you could snapshot the stack. Indeed, by using a simple microbenchmark we don’t see the measurement of this improvement when the stacks are ~20x frames deep - do the same hotspots show up or does capturing the stack start to dominate?

And one thing that could be needed is the ability to throw within a catch and if you do that you can corrupt the TLS (ie memory safety) unless you’re careful and follow the guidelines. In other words you personally can have written 100% safe code that is not memory sound unless you follow the high level rules - this is closer to a C API than anything that would be “allowed” as a traditional rust api where the guarantee of a safe API is that no unsoundness can happen no matter how you hold it. That’s a lot of safety to sacrifice for something tried and true. Use it if you really need it but I think following the advice that error states should be rare in the first place is probably better - return an error for any failable operation and panic on unwind. Trying to catch unwind panics is a landmine approach of trying to get things to work and I know from experience having tried that approach. It doesn’t play with things like async too. And then you have to bubble them up across threads?

This approach would fail there. These aren’t unfixable design flaws thankfully. You’d need a sum type to have the underlying memory to be detachable to the heap and somehow guarantee it’s always detached safely and soundly before overwriting (eg having a counter in the TLS header that is copied to the struct being unwound to guarantee that the TLS values you think you are accessing indeed has not been overwritten or having a TLS pointer to the stack value containing the unwound value somehow be written through to detach whenever someone doesn’t call the right catch mechanism). So I think this work is super valuable and the ideas should be refined and mainlined because inefficiencies like this aren’t great but simultaneously no one should be writing error handling by catching unwraps except for very very limited situations that you can clearly articulate as necessary for the goal you are trying to achieve. Like I spawned a background thread but if the computation fails I can report the failure gracefully to the human operator of the machine in a non debug context. But in those cases you want to be a supervisor forked process that is responsible for process death only rather than compiling it all into one binary. I wish Rust made that part easier. Ie start the process in a different mode but then switch to panic so that you carry the performance gains (ie this crate should be built using optimized panic with unwind but then this other crate is with a different unwind mode and you could spawn the other crate through a guaranteed fork to fix the soundness potential and the panic information is serialized across the wire to the patent process via a private pipe and unwound that way). That would provide an easier API to indicate more clearly the delegation of responsibility you should have been catching unwind and how to structure your code operationally. However it can’t be the only way because you might have something like an http framework. And you want to “guarantee” that you deliver an HTTP response to a socket and log metrics before crashing and you want the next request to be handled immediately with minimal additional CPU work. You can’t just do it across a fork barrier because that’s an expensive thing to dispatch to a new thread in the happy path + you have a thread poll you need to keep healthy and alive to maintain in tokio - can’t fork or spawn a thread on every new inbound connection. So there are cases where you want to catch unwind which is to have consistent behavior in a framework even if the user’s code or our code has a bug (but in those cases you might probably use the panic method during debug builds to notice failures like that before your release to production where you prevent bugs from manifesting as a mechanism/gadget for attackers to DOS your service)

lesuorac1y ago· 1 in thread

This isn't benchmarked against returning a Result right?

Like wouldn't bypassing any unwinding be faster than improving unwinding. You already seem to have control over the code as the thrown and caught exception have to be the same so might as well just write the code as a `Result<T, Exception>` in the first place no?

ctz1y ago

The topic is how to make non-local returns faster. I would suggest that normal returns are massively faster, and not returning at all is faster still (zero instructions needed!) -- but neither of these are non-local returns.

j / k navigate · click thread line to collapse

34 comments

21 comments · 5 top-level

codeflo1y ago· 9 in thread

> Returning to an alternate address shouldn’t be significantly more expensive than returning to the default address, so this has to be cheap.

tsimionescu1y ago

beng-nl1y ago

codeflo1y ago

2 more replies

amluto1y ago

It may well be possible to do an alternate return, skipping a frame, that is, itself, very reliably predicted correctly. But it still looks like:

    CALL
    jump-without-RET

and the calls and the rets don’t line up. This defeats the return prediction on the next return.

jnordwick1y ago

1 more reply

IshKebab1y ago

Modern CPUs have a "return address stack" which basically mirrors the real stack and allows them to perfectly predict returns (for normal code anyway).

First explanation I found on Google. Haven't read it:

https://one2bla.me/cs6290/lesson4/return-address-stack.html

gpderetta1y ago

kaba01y ago

meindnoch1y ago

>Return instructions are perfectly predicted

As long as you don't overwrite the return address on the stack.

Joker_vD1y ago· 3 in thread

Well, unwinding can be as simple (and as fast) as

    MOV  SP, [installed_handler]
    JMP  [installed_handler+WORD]

but it only works if you don't need to run the defers/Drop's/destructors/etc. for stuff that's on the stack between the current frame and the handler's frame. Which you do, most of the time.

gpderetta1y ago

> it only works if you don't need to run the defers/Drop's/destructors/etc

Still very nice improvements from OP.

[1] although I'm not sure you can mix them in the same program or it is a platform-wide decision.

Joker_vD1y ago

> the unwinder has to take (or used to, things got better recently) a global lock to prevent races with dlclose

If someone from another thread decided to unload a library whose code is still being executed in this thread then this thread would normally crash anyhow, and do so irrecoverably, right?

1 more reply

Yoric1y ago

mmaniac1y ago· 2 in thread

The most interesting part of this article for me is at the beginning.

> Now imagine that calls could specify alternate return points, letting the callee decide the statement to return to:

  // Dreamed-up syntax
  fn f() {
        g() alternate |x| {
          dbg!(x); // x = 123
      };
  }

  fn g() -> () alternate i32 {
      return_alternate 123;
  }

Does there exist a language where some function

  fn foo() -> T throws U

is syntactic sugar for something more like?

  fn foo(ifOk: fn(T) -> !, ifExcept: fn(U) -> !) -> !

compressedgas1y ago

See Multi-return Function Call by Olin Shivers and David Fisher https://www.khoury.northeastern.edu/home/shivers/papers/mrlc...

gpderetta1y ago

scheme I guess?

weinzierl1y ago· 1 in thread

vlovich1231y ago

lesuorac1y ago· 1 in thread

This isn't benchmarked against returning a Result right?

ctz1y ago

j / k navigate · click thread line to collapse