You do have to provide flake rate probability to do the probability estimates, but even roughly correct rates work fine. Running bisects assuming a 5% chance of getting a false positive or negative barely adds more steps and greatly improves robustness.
The math is pretty simple too-- my old prototype might still be in Google's monorepo; I should reimplement it for the open source world.
(I suspect you could get pretty far with a Monte Carlo simulation, and that would let you bypass most of the math anyway.)
Here's a nice presentation on an analysis of the method, references start on slide 47
https://people.orie.cornell.edu/pfrazier/Presentations/2014....
Paraphrasing the theorem from Jedynak, Frazier, Sznitman (2011):
Suppose [flake probability] is constant, known, and bounded away from 1/2, and we use the entropy loss function. ... The policy that chooses [next test point] at the median of [current posterior distribution] is optimal.
And some python code that implements a version of the analysis
https://github.com/choderalab/thresholds/blob/master/thresho...
https://github.com/Ealdwulf/BBChop/blob/master/BBChop/doc/Ba...
(Edit- had wrong link originally)
When gathering more evidence, you'd use your new belief about which cookie bowl Fred has as P(H1)=0.6 and P(H2)=0.4
I was specifically interested in the application to binary search / bisection in the presence of flaky tests.
You can even do things like automatically git bisect the linux kernel, with a little care. I wrote this up a few years ago https://paulgraydon.co.uk/posts/2020-12-27-automated-kernel-....
I can't talk about the actual case that lead me to ever need to git bisect a kernel in the first place, but at the same time I also learned how to make a minimalist one-C-file initrd, because what I needed to catch was visible under a /sys or /dev mount, or something like that. I later realised it's possible to do the same thing with Go in slightly easier syntax, if you cajole it in to producing a statically compiled binary!
the algorithms for using binary search to efficiently reduce a set satisfying some predicate to a locally minimal satisfying subset* are new to me (though cox says zeller published a slightly buggy version in 01999! and meta's cinder a correct one in 02021), and seem brilliant; their applications are not limited to debugging. i wonder how it relates to hypothesis's test-case reduction algorithm; can one of them be considered an application of the other?
also, this idea of binary-search debugging of the program's call tree rather than your revision history (or program input, or a set of data instances) is also a brilliant one. and although they published it a decade ago, i hadn't heard about it until now
the examples of asynctimerchan=1, changing optimization settings, and changing sort algorithms have in common that in some sense they are behavior-preserving, so you can toggle them on and off at will during execution without breaking anything. i wonder how to apply this call-tree debugging if the change you're trying to narrow down is a change that has to be consistent throughout the program's execution. for example, suppose some code using your hash tables breaks when you switch to a new hash function, maybe because it inadvertently depended on enumeration order. if you change the hash function partway through the program, you won't be able to find things in your hash tables after that. you could change the algorithm per table, of course, and narrow it down to a particular table, but that won't give you the particular line of code
i need to think a bit more about this issue of 'hashing a list of program counters'. you could of course number the sequence of all subroutine invocations during a (deterministic! single-threaded!) execution, as gas does for macro invocations, and binary-search that dense numbering. (this is a variant of the technique carry_bit is calling 'optimization fuel', but one that requires support from a compiler or debugger.) but, since you're toggling options on and off that will change the number of subroutine calls, the numbering won't be stable; so this will tend to only reliably find single-culprit failures
you could possibly get a stable-enough numbering using pathnames like /3/5/1, meaning the the 1st subroutine called from the 5th subroutine called from the 3rd subroutine called from main(). that seems like it might in some sense be stabler than hashing the entire list of return addresses, and it would certainly permit a lower-overhead implementation using a debugger and breakpoints rather than a check in every leaf call. plausibly i'm overlooking a flaw in this form of 'sequential numbering'? does the hashed list get truncated at some point for stability?
often when you have a change that is in some sense behavior-preserving, which is to say, you have two ways to do the same thing, you can use generative testing systems like hypothesis to detect bugs in either of them: process the same input through both paths and verify that the results are equivalent in the appropriate sense. this doesn't require the instrumentation infrastructure russ is using here, but it does depend on you being able to identify the relevant 'input', which can be very hard
in itself that doesn't help with the kinds of bugs he's talking about here, though: bugs where both the old and new code is 'equivalent' by your lights, but some other client code that calls it doesn't find it equivalent. this suggests a different generative-testing approach: generatively inject behavioral perturbations which don't violate equivalence, attempting to provoke failures in client code. aslr and hash-table seed randomization are doing this for us for some purposes, but unlike generative-testing frameworks, they provoke outages in production, don't do test-case minimization, and don't record failing cases to make bisection easy and prevent later regressions. and they don't do things like shuffling the input to a non-stable sort subroutine
binary-search debugging does indeed feel magical. scaevolus seems to be saying there's a bayesian generalization of it for nondeterministic bugs that are effectively random? you can of course run the test 5 (or 1000) times on each revision you're binary-searching over, but it feels like, if the number of revisions you're searching over is several thousand, you ought to be able to get some additional advantage out of running the test once on each of 5 (or 1000) revisions. can you solve this just by maximizing the expected shannon information of each test?
on a side note, it's pretty appalling that 30 years ago the plan9 group had `yesterday -d -n 7 anyfilename` to see what changed in the last week, thanks to their optical jukebox, while in the mainstream we still struggle with accidental file deletion and overwrites despite routinely carrying around terabytes in our pockets
on an even more marginally relevant note, earlier this week i was perusing the 7th edition unix kernel, in which the subroutine that switches stacks (the one with the well-known comment in 6th edition) is called swtch(). and tonight i just realized why russ cox uses that domain name
______
* conventionally this is just called a 'minimal satisfying subset', because it's 'minimal' in the partial-order sense, but i think cox's term 'locally minimal' is clearer
For the hash table case, deciding the function per-table as you suggest is probably good enough. In a large program there are going to be tons of hash tables, and if bisect gets you to "it's this specific hash table allocated by this call stack that matters", that's a huge win.
The timer change is much like the hash table one, in that we toggle the "timer kind" at creation time, but it's only later operations that actually change behavior. Still, identifying the specific timer and call stack that created it turned out to be good enough in all cases.
does the list of callsite return counters being hashed get truncated at some point for stability? i'd think that hashing the call stack f→g→j→k→m→n→o→p to a hash unrelated to f→g→h→j→k→m→n→o→p would tend to interfere with the 'forced' set, but maybe this is only being used for things where the behavior is so closely equivalent that this problem doesn't arise
There is a compiler fuzzing method called "equivalence modulo inputs" that does something like this. Take a fuzzed program with fixed inputs and profile it to find parts that are never executed. Change only code in those never executed parts. If the original and modified programs behave differently, you found a compiler bug: The compiler got confused by code that it can see but that (unknown to the compiler) cannot dynamically influence the program's behavior on the given inputs.
Like so many things in fuzzing, this sounds silly but actually finds bugs in practice.
There is only a decision about whether to use the old or new implementation based on the hash of the call stack at that moment. Then you binary search on hash values to identify the exact call stack hash for which the new implementation causes a problem. Then you run the program once more with the instructions "print the stack with this hash".
This is significantly different because it actually changes the program that is being run. It's not common at all - probably because there aren't too many situations where it applies. You need to be processing some large input that you don't understand, and commonly change bits of your code in ways that you can turn on and off dynamically during a single run.
EDIT: Oops, this tool is mentioned in the post already (but the post is not, so here it is if you want to read about it). Neat!
A global optimization fuel that worked at finer granularity would be even more precise but you'd have to have the compiler run single-threaded to make sure the numbering always matches. At least in the Go compiler, we compile different functions in different threads, so that fouls up any global numbering.
I went so far as to find the commit where David Chase added for loopvar on Mar 6, 2023 (github: golang/go/commit/c20d959) to try to design my own hello world with x/tools/cmd/bisect, but I'm out of my depth.
The hash tree is a great visualization. I wouldn't have grasped the importance of the hash suffix until I saw the tree. Awesome stuff.
You just call x509sha1.Value() and make a decision about old or new behavior based on that. Bisect can then toggle that result based on the hash of the call stack to narrow down the exact path that leads to the critical decision.
You can ignore the IncNonDefault call a few lines later. That's about providing monitoring for whether any non-default (old) behaviors are being executed in a program, so that people can hook it up to something like prometheus and see whether they would be affected by changing to "always default behavior".
And there is a script bisecting functions when given assembly produced by working baseline and “bad” compiler to compare: https://github.com/llvm/llvm-project/blob/main/llvm/utils/ab...
Also, hi Matthias!
It reminds me a bit of one my favorite debugging techniques: Running an individual test with coverage reporting enabled, and then clicking around the HTML coverage report to see exactly what code path the test followed, without needing to use either print statements or a specialized debugger.
Very helpful for answering questions of the form "Why does this test not exercise the code I thought it did?".