We ran benchmarks comparing bisect vs bayesect across flakiness levels. At 90/10, bisect drops to ~44% accuracy while bayesect holds at ~96%. At 70/30 it's 9% vs 67%. The entropy-minimization selection is key here since naive median splitting converges much slower.
One thing we found, you can squeeze out another 10-15% accuracy by weighting the prior with code structure. Commits that change highly-connected functions (many transitive dependents in the call graph) are more likely culprits than commits touching isolated code. That prior is free, zero test runs needed.
Information-theoretically, the structural prior gives you I_prior bits before running any test, reducing the total tests needed from log2(n)/D_KL to (log2(n) - I_prior)/D_KL. On 1024-commit repos with 80/20 flakiness: 92% accuracy with graph priors vs 85% pure bayesect vs 10% git bisect.
We're building this into sem (https://github.com/ataraxy-labs/sem), which has an entity dependency graph that provides the structural signal.
I don't understand what you're comparing. Can't you increase bayesect accuracy arbitrarily by running it longer? When are you choosing to terminate? Perhaps I don't understand this after all.
This script in the repo https://github.com/hauntsaninja/git_bayesect/blob/main/scrip... will show you that a) the confidence level is calibrated, b) how quickly you get to that confidence level (on average, p50 and p95)
For the failure rates you describe, calibration.py shows that you should see much higher accuracy at 300 tests
In addition to the repo linked in the title, I also wrote up a little bit of the math behind it here: https://hauntsaninja.github.io/git_bayesect.html
I'm going to have to check out how you got linear time with Shannon entropy, because I used Renyi entropy to do that, to make the algebra easier.
It's also possible to do it over the DAG, rather than a linear history - although that makes the code a lot more complicated. Unfortunately there doesn't seem to be a linear time cumulative sum algorithm over dags, so it's super linear in some cases.
But, to merge we need to have all tests pass. (If tests flakily pass then we get new flakey tests, yay!)
I know git-bisect doesn’t support this: but could git-bayesect have an option to only consider merge commits? Being able to track a flake change back to an individual PR would be really useful.
I haven't worked these all the way through, but I'm slightly skeptical or at least confused by a few details:
1. Another way to frame P(D|B=b) would be to have the old vs new side draws be beta-binomial distributed, in which case we should then have binomial coefficients for each of the draw side probabilities for the number of possible orderings of the observations. Do they end up cancelling out somewhere? [ed: Oh yes, of course -- D includes that in each case we observe exactly one of the C(n,k) orderings.]
2. I think your expected conditional entropy code is treating the imputed new observations as independent from the past observations, though even if that's the case it may not affect it much in this model. If it does though, it might be worth explicitly unit-testing the naive vs efficient calculations to ensure they match.
Anyway, thanks for sharing!
A related situation I was in recently was where I was trying to bisect a perf regression, but the benchmarks themselves were quite noisy, making it hard to tell whether I was looking at a "good" vs "bad" commit without repeated trials (in practice I just did repeats).
I could pick a threshold and use bayesect as described, but that involves throwing away information. How hard would it be to generalize this to let me plug in a raw benchmark score at each step?
I vibe up a lot of really simple casual games, which should have very minimal demands, and the LLM-agent introduces bad things a lot that don't present right away. Either it takes multiple bad things to notice, or it doesn't really affect anything on a dev machine but is horrible on wasm+mobile builds, or I just don't notice right away.
This is all really hard to track down, there's noise in the heuristics, and I don't know if I'm looking for one really dumb thing or a bunch of small things that have happened over time.
I think if you assume perf is normally distributed, you can still get some of the math to work out. But I will need to think more about this... if I ever choose this adventure, I'll post an update on https://github.com/hauntsaninja/git_bayesect/issues/25
(I really enjoy how many generalisations there are of this problem :-) )
It's easy to construct ways that this could happen. Maybe you're running a benchmark that does a garbage collection or three. It's easy for a GC to be a little earlier or a little later, and sneak in or out of the timed portion of the test.
Warm starts vs cold starts can also do this. If you don't tear everything down and flush before beginning a test, you might have some amount of stuff cached.
The law of large numbers says you can still make it normal by running enough times and adding them up (or running each iteration long enough), but (1) that takes much longer and (2) smushed together data is often less actionable. You kind of want to know about fast paths and slow paths and that you're falling off the fast path more often than intended.
As usual you can probably cover your eyes, stick your fingers in your ears, and proceed as if everything were Gaussian. It'll probably work well enough!
Does it support running a test multiple times to get a probability for a single commit instead of just pass/fail? I guess you’d also need to take into account the number of trials to update the Beta properly.
IIUC the way you'd do that right now is just repeatedly recording the individual observations on a single commit, which effectively gives it a probability + the number of trials to do the Beta update. I don't yet have a CLI entrypoint to record a batch observation of (probability, num_trials), but it would be easy to add one
But ofc part of the magic is that git_bayesect's commit selection tells you how to be maximally sample efficient, so you'd only want to do a batch record if your test has high constant overhead
I'm not sure how to choose an optimal value of N. My first hunch is make it so that it takes at least as long to run all the tests as it takes to setup (checkout, compile link etc.), but it may make sense to go a lot more than that. I'd have to do some thinking about the maths.
edit: thanks for the responses! I was not even familiar with `git bisect` before this, so I've got some new things to learn.
One way to think about the tool presented is that it minimizes the number of times you'd need to run your test suite, to locate the bad commit.
For example, suppose you have some piece of hardware which you can interrogate, but not after it crashes. It crashes at a deterministic point. You can step it forward by any amount of steps, but only examine it's state if it did not crash. If it crashed, you have to go back to the start. (I call this situation "Finnegan Search", after the nursery rhyme which prominently features the line "poor old Finnegan had to begin again").
The deterministic algorithm has you do an examination after every step. The nondeterministic algorithm has you choose some number of steps, accepting the risk that you have to go back to the start. The optimal number of steps (and thus the choice of algorithm) depends on the ratio of the cost of examination to the cost of a step. It can be found analytically as the expected information gain per unit time.
(Either way the process is pretty annoying and considerable effort in hardware and software design has gone into providing ways to render it unnecessary, but it still crops up sometimes in embedded systems).
Often these bugs depend on timing, caused by unpredictable thread scheduling, CPU load, disk and networking timing, etc. Git commits can affect app timing and change the likelihood of the bug occurring, but in many cases these changes aren't related to the underlying bug. That's distinct from a regular git bisect to find a deterministic bug.
One cool bayesect application is to identify the commit that most frequently hits the bug, so it's easier to debug. But more broadly, I'm wondering about the underlying philosophy of bisection for nondeterministic bugs, and when I'd use it?
https://research.swtch.com/bisect
TLDR: you can bisect on more than just "time".
Great idea anyway!
Now hang on for a bit, you can't just plug in averages.
At least that's what I initially thought, but in this particular instance it works out correctly because you're calculating an expected value of the entropy from the two possible outcomes and there the posterior mean is indeed the correct probability to use.
You do have to take the prior into account when calculating the posterior distributions for B, but that formula is in the article.