Hash-based bisect debugging in compilers and runtimes (opens in new tab)

(research.swtch.com)

254 pointsrsc1y ago45 comments

45 comments

36 comments · 12 top-level

Scaevolus1y ago· 9 in thread

Aside: bisecting flakes doesn't have to involve repeated runs. You can reformulate bisection as an information probing operation, expanding the scope to support noisy benchmarks or low-probability flakes. Bayesian inference narrows down the probable range of the failure for each new observation, and you can choose new probes to maximize information gain-- or even run them in parallel to minimize the total time.

You do have to provide flake rate probability to do the probability estimates, but even roughly correct rates work fine. Running bisects assuming a 5% chance of getting a false positive or negative barely adds more steps and greatly improves robustness.

The math is pretty simple too-- my old prototype might still be in Google's monorepo; I should reimplement it for the open source world.

rscOP1y ago

Indeed. Something like this is what I intend to do when I get some time.

eru1y ago

Could you briefly sketch out the math (or point to a sketch) so that other people can pick it up? I'm quite interested!

(I suspect you could get pretty far with a Monte Carlo simulation, and that would let you bypass most of the math anyway.)

jhfdbkofdchk1y ago

The Probabilistic Bisection Algorithm by Horstein, original paper is "Sequential transmission using noiseless feedback," IEEE Trans. Inform. Theory, 9(3):136–143. https://ieeexplore.ieee.org/document/1057832

Here's a nice presentation on an analysis of the method, references start on slide 47

https://people.orie.cornell.edu/pfrazier/Presentations/2014....

Paraphrasing the theorem from Jedynak, Frazier, Sznitman (2011):

Suppose [flake probability] is constant, known, and bounded away from 1/2, and we use the entropy loss function. ... The policy that chooses [next test point] at the median of [current posterior distribution] is optimal.

And some python code that implements a version of the analysis

https://github.com/choderalab/thresholds/blob/master/thresho...

1 more reply

ajb1y ago

Here is a doc explaining a Bayesian search algorithm; not sure if it's precisely what GP has in mind.

https://github.com/Ealdwulf/BBChop/blob/master/BBChop/doc/Ba...

(Edit- had wrong link originally)

eru1y ago

Thanks!

dwattttt1y ago

For Bayesian Inference, the example in the Wikipedia article is good (who doesn't like cookies?): https://en.m.wikipedia.org/wiki/Bayesian_inference#Examples

When gathering more evidence, you'd use your new belief about which cookie bowl Fred has as P(H1)=0.6 and P(H2)=0.4

eru1y ago

That just explains Bayesian inference in general, which is useful, but not what I'm after.

I was specifically interested in the application to binary search / bisection in the presence of flaky tests.

1 more reply

efdb1y ago

millipede1y ago· 5 in thread

Are there any non-compiler use cases for this technique?

rscOP1y ago

Anyone working on libraries that are used by large programs can use them, like in the sort and timer cases described toward the end of the paper. When you work on libraries used by other larger programs you inevitably break them accidentally. This technique pinpoints the exact context of the breakage.

Twirrim1y ago

All sorts of stuff, I git bisect reasonably regularly.

You can even do things like automatically git bisect the linux kernel, with a little care. I wrote this up a few years ago https://paulgraydon.co.uk/posts/2020-12-27-automated-kernel-....

I can't talk about the actual case that lead me to ever need to git bisect a kernel in the first place, but at the same time I also learned how to make a minimalist one-C-file initrd, because what I needed to catch was visible under a /sys or /dev mount, or something like that. I later realised it's possible to do the same thing with Go in slightly easier syntax, if you cajole it in to producing a statically compiled binary!

IshKebab1y ago

He means the runtime bisection that this article is about, not `git bisect`.

saagarjha1y ago

You can use this for any codebase of sufficient complexity where you want to identify which change is causing the effect you're observing.

IshKebab1y ago

Not really, it's only suitable where your code is processing some other large input that you don't understand, and when your changes are usually easily toggleable at runtime and can be applied to only part of the input.

2 more replies

kragen1y ago· 3 in thread

this is a wonderful post!

the algorithms for using binary search to efficiently reduce a set satisfying some predicate to a locally minimal satisfying subset* are new to me (though cox says zeller published a slightly buggy version in 01999! and meta's cinder a correct one in 02021), and seem brilliant; their applications are not limited to debugging. i wonder how it relates to hypothesis's test-case reduction algorithm; can one of them be considered an application of the other?

also, this idea of binary-search debugging of the program's call tree rather than your revision history (or program input, or a set of data instances) is also a brilliant one. and although they published it a decade ago, i hadn't heard about it until now

the examples of asynctimerchan=1, changing optimization settings, and changing sort algorithms have in common that in some sense they are behavior-preserving, so you can toggle them on and off at will during execution without breaking anything. i wonder how to apply this call-tree debugging if the change you're trying to narrow down is a change that has to be consistent throughout the program's execution. for example, suppose some code using your hash tables breaks when you switch to a new hash function, maybe because it inadvertently depended on enumeration order. if you change the hash function partway through the program, you won't be able to find things in your hash tables after that. you could change the algorithm per table, of course, and narrow it down to a particular table, but that won't give you the particular line of code

i need to think a bit more about this issue of 'hashing a list of program counters'. you could of course number the sequence of all subroutine invocations during a (deterministic! single-threaded!) execution, as gas does for macro invocations, and binary-search that dense numbering. (this is a variant of the technique carry_bit is calling 'optimization fuel', but one that requires support from a compiler or debugger.) but, since you're toggling options on and off that will change the number of subroutine calls, the numbering won't be stable; so this will tend to only reliably find single-culprit failures

you could possibly get a stable-enough numbering using pathnames like /3/5/1, meaning the the 1st subroutine called from the 5th subroutine called from the 3rd subroutine called from main(). that seems like it might in some sense be stabler than hashing the entire list of return addresses, and it would certainly permit a lower-overhead implementation using a debugger and breakpoints rather than a check in every leaf call. plausibly i'm overlooking a flaw in this form of 'sequential numbering'? does the hashed list get truncated at some point for stability?

often when you have a change that is in some sense behavior-preserving, which is to say, you have two ways to do the same thing, you can use generative testing systems like hypothesis to detect bugs in either of them: process the same input through both paths and verify that the results are equivalent in the appropriate sense. this doesn't require the instrumentation infrastructure russ is using here, but it does depend on you being able to identify the relevant 'input', which can be very hard

in itself that doesn't help with the kinds of bugs he's talking about here, though: bugs where both the old and new code is 'equivalent' by your lights, but some other client code that calls it doesn't find it equivalent. this suggests a different generative-testing approach: generatively inject behavioral perturbations which don't violate equivalence, attempting to provoke failures in client code. aslr and hash-table seed randomization are doing this for us for some purposes, but unlike generative-testing frameworks, they provoke outages in production, don't do test-case minimization, and don't record failing cases to make bisection easy and prevent later regressions. and they don't do things like shuffling the input to a non-stable sort subroutine

binary-search debugging does indeed feel magical. scaevolus seems to be saying there's a bayesian generalization of it for nondeterministic bugs that are effectively random? you can of course run the test 5 (or 1000) times on each revision you're binary-searching over, but it feels like, if the number of revisions you're searching over is several thousand, you ought to be able to get some additional advantage out of running the test once on each of 5 (or 1000) revisions. can you solve this just by maximizing the expected shannon information of each test?

on a side note, it's pretty appalling that 30 years ago the plan9 group had `yesterday -d -n 7 anyfilename` to see what changed in the last week, thanks to their optical jukebox, while in the mainstream we still struggle with accidental file deletion and overwrites despite routinely carrying around terabytes in our pockets

on an even more marginally relevant note, earlier this week i was perusing the 7th edition unix kernel, in which the subroutine that switches stacks (the one with the well-known comment in 6th edition) is called swtch(). and tonight i just realized why russ cox uses that domain name

______

* conventionally this is just called a 'minimal satisfying subset', because it's 'minimal' in the partial-order sense, but i think cox's term 'locally minimal' is clearer

rscOP1y ago

One small correction: we were doing bisect over compiler optimizations a decade ago, but bisect over call trees only happened in the past year or so.

For the hash table case, deciding the function per-table as you suggest is probably good enough. In a large program there are going to be tons of hash tables, and if bisect gets you to "it's this specific hash table allocated by this call stack that matters", that's a huge win.

The timer change is much like the hash table one, in that we toggle the "timer kind" at creation time, but it's only later operations that actually change behavior. Still, identifying the specific timer and call stack that created it turned out to be good enough in all cases.

kragen1y ago

thank you very much for the correction and the clarifications—and for the excellent post

does the list of callsite return counters being hashed get truncated at some point for stability? i'd think that hashing the call stack f→g→j→k→m→n→o→p to a hash unrelated to f→g→h→j→k→m→n→o→p would tend to interfere with the 'forced' set, but maybe this is only being used for things where the behavior is so closely equivalent that this problem doesn't arise

1 more reply

derdi1y ago

> this suggests a different generative-testing approach: generatively inject behavioral perturbations which don't violate equivalence, attempting to provoke failures in client code

There is a compiler fuzzing method called "equivalence modulo inputs" that does something like this. Take a fuzzed program with fixed inputs and profile it to find parts that are never executed. Change only code in those never executed parts. If the original and modified programs behave differently, you found a compiler bug: The compiler got confused by code that it can see but that (unknown to the compiler) cannot dynamically influence the program's behavior on the given inputs.

Like so many things in fuzzing, this sounds silly but actually finds bugs in practice.

camgunz1y ago· 2 in thread

This is pretty close to what I built for maintaining demo compatibility in Doom engines. Basically it runs a demo and dumps a save game to a file every frame. As soon as there's a divergence it says what the difference is (monster 17 is at (4, 22); should be (4, 21)) and bails. Not a ton of difference between that and diffing the stack.

https://github.com/camgunz/democomp

rscOP1y ago

What you are describing sounds like comparing traces of two supposed-to-be-identical programs to see where they first diverge. That's not what this is describing. There is no trace at all, nor any decision about what to trace or what not to trace.

There is only a decision about whether to use the old or new implementation based on the hash of the call stack at that moment. Then you binary search on hash values to identify the exact call stack hash for which the new implementation causes a problem. Then you run the program once more with the instructions "print the stack with this hash".

IshKebab1y ago

What you implemented is just checking output against a reference. That's an extremely common testing method.

This is significantly different because it actually changes the program that is being run. It's not common at all - probably because there aren't too many situations where it applies. You need to be processing some large input that you don't understand, and commonly change bits of your code in ways that you can turn on and off dynamically during a single run.

tekknolagi1y ago· 1 in thread

This is very cool and seems similar to what we do/did in Cinder: https://bernsteinbear.com/blog/cinder-jit-bisect/

EDIT: Oops, this tool is mentioned in the post already (but the post is not, so here it is if you want to read about it). Neat!

rscOP1y ago

Thanks for that link. I've changed the link beneath the Cinder mention to that page instead of the GitHub page.

carry_bit1y ago· 1 in thread

A closely related technique for debugging optimization passes is that of "optimization fuel". Each rewrite decreases the fuel by one, and when the fuel is gone no more rewrites happen. You can then perform binary search on the optimization fuel to find a specific rewrite instance that breaks things.

rscOP1y ago

Yes, I believe LLVM has a flag for "run only the first N optimization passes" that gets used with binary search this way.

A global optimization fuel that worked at finer granularity would be even more precise but you'd have to have the compiler run single-threaded to make sure the numbering always matches. At least in the Go compiler, we compile different functions in different threads, so that fouls up any global numbering.

saagarjha1y ago· 1 in thread

I'm running one of these now, interestingly enough. Apparently something's broken in OpenJDK if you build with the new Xcode. So I'm bisecting on all the files (choosing either the old or new compiler) trying to see which one is breaking things.

saagarjha1y ago

For those following along, I ended up doing four levels of bisects to find the actual problem. First the bisect I described above to identify the problematic file–stackMapTable.cpp. Then another on the functions inside of the file with optimizations and without to find the function that was broken: StackMapReader::next(StackMapFrame, bool, unsigned short, unsigned short, JavaThread). With that I manually inspected the code generated, found that it was wrong, and then did a bisect on optimization passes to see which one broke it (ConstraintEliminationPass). Finally one last bisect to see the history of where this came from (seems to be fixed in a change Xcode hasn't pulled yet).

mirrorlake1y ago· 1 in thread

In the event that this is added to the standard library, I'm going to be really curious to see what a "hello world" project/example would look like.

I went so far as to find the commit where David Chase added for loopvar on Mar 6, 2023 (github: golang/go/commit/c20d959) to try to design my own hello world with x/tools/cmd/bisect, but I'm out of my depth.

The hash tree is a great visualization. I wouldn't have grasped the importance of the hash suffix until I saw the tree. Awesome stuff.

rscOP1y ago

If we add it to the standard library it will be for runtime uses (identify the call stack for which swapping in the new implementation or behavior breaks the test). It is pretty simple. You call 'godebug.New' in a global init to get a lookup keyed to a particular setting, and then you look at that setting at runtime. For example take a look at https://go.dev/src/crypto/x509/x509.go#L881 and look at the uses of x509sha1 a few lines later.

You just call x509sha1.Value() and make a decision about old or new behavior based on that. Bisect can then toggle that result based on the hash of the call stack to narrow down the exact path that leads to the critical decision.

You can ignore the IncNonDefault call a few lines later. That's about providing monitoring for whether any non-default (old) behaviors are being executed in a program, so that people can hook it up to something like prometheus and see whether they would be affected by changing to "always default behavior".

MatzeBraun1y ago· 1 in thread

LLVM can bisect down to individual transformation steps within a pass (for passes that implement the interface): https://llvm.org/docs/OptBisect.html

And there is a script bisecting functions when given assembly produced by working baseline and “bad” compiler to compare: https://github.com/llvm/llvm-project/blob/main/llvm/utils/ab...

tekknolagi1y ago

This is pretty cool. Thanks for the pointer.

Also, hi Matthias!

tabbott1y ago

This is so cool!

It reminds me a bit of one my favorite debugging techniques: Running an individual test with coverage reporting enabled, and then clicking around the HTML coverage report to see exactly what code path the test followed, without needing to use either print statements or a specialized debugger.

Very helpful for answering questions of the form "Why does this test not exercise the code I thought it did?".

drivebycomment1y ago

In my former life, I used to maintain a script that can be given two sets of objects files, one compiled with optimization, and one without, and the script will effectively do a binary search by choosing which object files you link and run the executable to determine success/fail. Each iteration is quick, since linking step is usually fast. This was useful when troubleshooting a big binary, since optimized build back then was often quite slow for a large executable.

chc41y ago

When I read https://marcan.st/2017/12/debugging-an-evil-go-runtime-bug/ and saw the hash bisection trick for the first time I was super impressed, it really does sound incredibly slick :) I imagine that's how first coming across normal git bisect must feel to new engineers.

j / k navigate · click thread line to collapse

45 comments

36 comments · 12 top-level

Scaevolus1y ago· 9 in thread

The math is pretty simple too-- my old prototype might still be in Google's monorepo; I should reimplement it for the open source world.

rscOP1y ago

Indeed. Something like this is what I intend to do when I get some time.

eru1y ago

Could you briefly sketch out the math (or point to a sketch) so that other people can pick it up? I'm quite interested!

(I suspect you could get pretty far with a Monte Carlo simulation, and that would let you bypass most of the math anyway.)

jhfdbkofdchk1y ago

Here's a nice presentation on an analysis of the method, references start on slide 47

https://people.orie.cornell.edu/pfrazier/Presentations/2014....

Paraphrasing the theorem from Jedynak, Frazier, Sznitman (2011):

And some python code that implements a version of the analysis

https://github.com/choderalab/thresholds/blob/master/thresho...

1 more reply

ajb1y ago

Here is a doc explaining a Bayesian search algorithm; not sure if it's precisely what GP has in mind.

https://github.com/Ealdwulf/BBChop/blob/master/BBChop/doc/Ba...

(Edit- had wrong link originally)

eru1y ago

Thanks!

dwattttt1y ago

For Bayesian Inference, the example in the Wikipedia article is good (who doesn't like cookies?): https://en.m.wikipedia.org/wiki/Bayesian_inference#Examples

When gathering more evidence, you'd use your new belief about which cookie bowl Fred has as P(H1)=0.6 and P(H2)=0.4

eru1y ago

That just explains Bayesian inference in general, which is useful, but not what I'm after.

I was specifically interested in the application to binary search / bisection in the presence of flaky tests.

1 more reply

efdb1y ago

millipede1y ago· 5 in thread

Are there any non-compiler use cases for this technique?

rscOP1y ago

Twirrim1y ago

All sorts of stuff, I git bisect reasonably regularly.

You can even do things like automatically git bisect the linux kernel, with a little care. I wrote this up a few years ago https://paulgraydon.co.uk/posts/2020-12-27-automated-kernel-....

IshKebab1y ago

He means the runtime bisection that this article is about, not `git bisect`.

saagarjha1y ago

You can use this for any codebase of sufficient complexity where you want to identify which change is causing the effect you're observing.

IshKebab1y ago

2 more replies

kragen1y ago· 3 in thread

this is a wonderful post!

______

* conventionally this is just called a 'minimal satisfying subset', because it's 'minimal' in the partial-order sense, but i think cox's term 'locally minimal' is clearer

rscOP1y ago

One small correction: we were doing bisect over compiler optimizations a decade ago, but bisect over call trees only happened in the past year or so.

kragen1y ago

thank you very much for the correction and the clarifications—and for the excellent post

1 more reply

derdi1y ago

> this suggests a different generative-testing approach: generatively inject behavioral perturbations which don't violate equivalence, attempting to provoke failures in client code

Like so many things in fuzzing, this sounds silly but actually finds bugs in practice.

camgunz1y ago· 2 in thread

https://github.com/camgunz/democomp

rscOP1y ago

IshKebab1y ago

What you implemented is just checking output against a reference. That's an extremely common testing method.

tekknolagi1y ago· 1 in thread

This is very cool and seems similar to what we do/did in Cinder: https://bernsteinbear.com/blog/cinder-jit-bisect/

EDIT: Oops, this tool is mentioned in the post already (but the post is not, so here it is if you want to read about it). Neat!

rscOP1y ago

Thanks for that link. I've changed the link beneath the Cinder mention to that page instead of the GitHub page.

carry_bit1y ago· 1 in thread

rscOP1y ago

Yes, I believe LLVM has a flag for "run only the first N optimization passes" that gets used with binary search this way.

saagarjha1y ago· 1 in thread

saagarjha1y ago

mirrorlake1y ago· 1 in thread

In the event that this is added to the standard library, I'm going to be really curious to see what a "hello world" project/example would look like.

The hash tree is a great visualization. I wouldn't have grasped the importance of the hash suffix until I saw the tree. Awesome stuff.

rscOP1y ago

MatzeBraun1y ago· 1 in thread

LLVM can bisect down to individual transformation steps within a pass (for passes that implement the interface): https://llvm.org/docs/OptBisect.html

And there is a script bisecting functions when given assembly produced by working baseline and “bad” compiler to compare: https://github.com/llvm/llvm-project/blob/main/llvm/utils/ab...

tekknolagi1y ago

This is pretty cool. Thanks for the pointer.

Also, hi Matthias!

tabbott1y ago

This is so cool!

Very helpful for answering questions of the form "Why does this test not exercise the code I thought it did?".

drivebycomment1y ago

chc41y ago

j / k navigate · click thread line to collapse