Show HN: Git bayesect – Bayesian Git bisection for non-deterministic bugs (opens in new tab)

(github.com)

337 pointshauntsaninja1mo ago43 comments

43 comments

Really fun work, and the writeup on the math is great. The Beta-Bernoulli conjugacy trick making the marginal likelihood closed-form is elegant.

We ran benchmarks comparing bisect vs bayesect across flakiness levels. At 90/10, bisect drops to ~44% accuracy while bayesect holds at ~96%. At 70/30 it's 9% vs 67%. The entropy-minimization selection is key here since naive median splitting converges much slower.

One thing we found, you can squeeze out another 10-15% accuracy by weighting the prior with code structure. Commits that change highly-connected functions (many transitive dependents in the call graph) are more likely culprits than commits touching isolated code. That prior is free, zero test runs needed.

Information-theoretically, the structural prior gives you I_prior bits before running any test, reducing the total tests needed from log2(n)/D_KL to (log2(n) - I_prior)/D_KL. On 1024-commit repos with 80/20 flakiness: 92% accuracy with graph priors vs 85% pure bayesect vs 10% git bisect.

We're building this into sem (https://github.com/ataraxy-labs/sem), which has an entity dependency graph that provides the structural signal.

sfink1mo ago

> We ran benchmarks comparing bisect vs bayesect across flakiness levels. At 90/10, bisect drops to ~44% accuracy while bayesect holds at ~96%. At 70/30 it's 9% vs 67%.

I don't understand what you're comparing. Can't you increase bayesect accuracy arbitrarily by running it longer? When are you choosing to terminate? Perhaps I don't understand this after all.

rs5458371mo ago

Yes, bayesect accuracy increases with more iterations. The comparison was at a fixed budget(300 test runs) when I was running. Sorry should have clarified more on that.

hauntsaninjaOP1mo ago

Yep, you can run bayesect to an arbitrary confidence level.

This script in the repo https://github.com/hauntsaninja/git_bayesect/blob/main/scrip... will show you that a) the confidence level is calibrated, b) how quickly you get to that confidence level (on average, p50 and p95)

For the failure rates you describe, calibration.py shows that you should see much higher accuracy at 300 tests

1 more reply

hauntsaninjaOP1mo ago

git bisect works great for tracking down regressions, but relies on the bug presenting deterministically. But what if the bug is non-deterministic? Or worse, your behaviour was always non-deterministic, but something has changed, e.g. your tests went from somewhat flaky to very flaky.

In addition to the repo linked in the title, I also wrote up a little bit of the math behind it here: https://hauntsaninja.github.io/git_bayesect.html

ajb1mo ago

Nice! I implemented a similar thing a while back: https://github.com/Ealdwulf/BBChop

I'm going to have to check out how you got linear time with Shannon entropy, because I used Renyi entropy to do that, to make the algebra easier.

It's also possible to do it over the DAG, rather than a linear history - although that makes the code a lot more complicated. Unfortunately there doesn't seem to be a linear time cumulative sum algorithm over dags, so it's super linear in some cases.

Myrmornis1mo ago

This is really cool! Is there an alternative way of thinking about it involving a hidden markov model, looking for a change in value of an unknown latent P(fail)? Or does your approach end up being similar to whatever the appropriate Bayesian approach to the HMM would be?

1 more reply

belden1mo ago

My team doesn’t always have cleanly bisectable branches being merged to main —- it’s not uncommon to see “fix syntax error” types of commits.

But, to merge we need to have all tests pass. (If tests flakily pass then we get new flakey tests, yay!)

I know git-bisect doesn’t support this: but could git-bayesect have an option to only consider merge commits? Being able to track a flake change back to an individual PR would be really useful.

sluongng1mo ago

You can run bisect with first-parent

hauntsaninjaOP1mo ago

That sounds right. `git_bayesect` currently uses `--first-parent`, so I think belden's use case should work, but I haven't tested it much on complicated git histories.

menaerus1mo ago

> I know git-bisect doesn’t support this

It does.

llasram1mo ago

Neat! I work on a system with some very similar math, but a slightly different model. I really like how in bayesect making the error rates asymmetric via independent Beta priors on bidirectional errors allows the computations to be nice and symmetric.

I haven't worked these all the way through, but I'm slightly skeptical or at least confused by a few details:

1. Another way to frame P(D|B=b) would be to have the old vs new side draws be beta-binomial distributed, in which case we should then have binomial coefficients for each of the draw side probabilities for the number of possible orderings of the observations. Do they end up cancelling out somewhere? [ed: Oh yes, of course -- D includes that in each case we observe exactly one of the C(n,k) orderings.]

2. I think your expected conditional entropy code is treating the imputed new observations as independent from the past observations, though even if that's the case it may not affect it much in this model. If it does though, it might be worth explicitly unit-testing the naive vs efficient calculations to ensure they match.

Anyway, thanks for sharing!

Retr0id1mo ago

Super cool!

A related situation I was in recently was where I was trying to bisect a perf regression, but the benchmarks themselves were quite noisy, making it hard to tell whether I was looking at a "good" vs "bad" commit without repeated trials (in practice I just did repeats).

I could pick a threshold and use bayesect as described, but that involves throwing away information. How hard would it be to generalize this to let me plug in a raw benchmark score at each step?

furyofantares1mo ago

I have this same issue a lot.

I vibe up a lot of really simple casual games, which should have very minimal demands, and the LLM-agent introduces bad things a lot that don't present right away. Either it takes multiple bad things to notice, or it doesn't really affect anything on a dev machine but is horrible on wasm+mobile builds, or I just don't notice right away.

This is all really hard to track down, there's noise in the heuristics, and I don't know if I'm looking for one really dumb thing or a bunch of small things that have happened over time.

rs5458371mo ago

This is a real pain point. One thing that helps: when an LLM agent makes changes across multiple commits, look at what it actually touched structurally. Often the agent adds a feature in commit 5 but subtly breaks something in commit 3 by changing a shared function it didn't fully understand.

ajb1mo ago

At a guess, you can reuse the entropy part, but you'd need to plug in a new probability distribution.

hauntsaninjaOP1mo ago

I don't yet know a better way to do this than using a threshold!

I think if you assume perf is normally distributed, you can still get some of the math to work out. But I will need to think more about this... if I ever choose this adventure, I'll post an update on https://github.com/hauntsaninja/git_bayesect/issues/25

(I really enjoy how many generalisations there are of this problem :-) )

sfink1mo ago

Perf is not normally distributed. Or rather, it is very very common for it to not be. Where I work, we often see multimodal (usually mostly bimodal) distributions, and we'll get performance alerts that when you look at them are purely a result of more samples happening to shift from one mode to another.

It's easy to construct ways that this could happen. Maybe you're running a benchmark that does a garbage collection or three. It's easy for a GC to be a little earlier or a little later, and sneak in or out of the timed portion of the test.

Warm starts vs cold starts can also do this. If you don't tear everything down and flush before beginning a test, you might have some amount of stuff cached.

The law of large numbers says you can still make it normal by running enough times and adding them up (or running each iteration long enough), but (1) that takes much longer and (2) smushed together data is often less actionable. You kind of want to know about fast paths and slow paths and that you're falling off the fast path more often than intended.

As usual you can probably cover your eyes, stick your fingers in your ears, and proceed as if everything were Gaussian. It'll probably work well enough!

supermdguy1mo ago

Okay this is really fun and mathematically satisfying. Could even be useful for tough bugs that are technically deterministic, but you might not have precise reproduction steps.

Does it support running a test multiple times to get a probability for a single commit instead of just pass/fail? I guess you’d also need to take into account the number of trials to update the Beta properly.

hauntsaninjaOP1mo ago

Yay, I had fun with it too!

IIUC the way you'd do that right now is just repeatedly recording the individual observations on a single commit, which effectively gives it a probability + the number of trials to do the Beta update. I don't yet have a CLI entrypoint to record a batch observation of (probability, num_trials), but it would be easy to add one

But ofc part of the magic is that git_bayesect's commit selection tells you how to be maximally sample efficient, so you'd only want to do a batch record if your test has high constant overhead

__s1mo ago

recompiling can be high constant overhead

ajb1mo ago

In theory, the algorithm could deal with that by choosing the commit at each step, which gives the best expected information gain; divided by expected test time. In most cases it would be more efficient just to cache the compiled output though.

1 more reply

rjmunro1mo ago

I think if you make your test script compile and then run the tests up to N times, failing on first fail, then when you run bayesect, it just "sees" a test that is "N times more" deterministic, so will behave appropriately.

I'm not sure how to choose an optimal value of N. My first hunch is make it so that it takes at least as long to run all the tests as it takes to setup (checkout, compile link etc.), but it may make sense to go a lot more than that. I'd have to do some thinking about the maths.

SugarReflex1mo ago

I hope this comment is not out of place, but I am wondering what the application for all this is? How can this help us or what does it teach us or help us prove? I am asking out of genuine curiosity as I barely understand it but I believe it has something to do with probability.

edit: thanks for the responses! I was not even familiar with `git bisect` before this, so I've got some new things to learn.

augusto-moura1mo ago

The writeup [1] linked on the README has examples and a better explanation

[1]: https://hauntsaninja.github.io/git_bayesect.html

Retr0id1mo ago

If you're git bisecting a flakey test, normally your only option is to run the test many times until you're ~certain it's either flakey or not flakey. If your test suite is slow, this can take a long time.

One way to think about the tool presented is that it minimizes the number of times you'd need to run your test suite, to locate the bad commit.

ajb1mo ago

It's worth noting that the analysis (although not this specific algorithm) applies in cases where there is a deterministic approach, but a nondeterministic algorithm is faster.

For example, suppose you have some piece of hardware which you can interrogate, but not after it crashes. It crashes at a deterministic point. You can step it forward by any amount of steps, but only examine it's state if it did not crash. If it crashed, you have to go back to the start. (I call this situation "Finnegan Search", after the nursery rhyme which prominently features the line "poor old Finnegan had to begin again").

The deterministic algorithm has you do an examination after every step. The nondeterministic algorithm has you choose some number of steps, accepting the risk that you have to go back to the start. The optimal number of steps (and thus the choice of algorithm) depends on the ratio of the cost of examination to the cost of a step. It can be found analytically as the expected information gain per unit time.

(Either way the process is pretty annoying and considerable effort in hardware and software design has gone into providing ways to render it unnecessary, but it still crops up sometimes in embedded systems).

teckywoe1mo ago

Opening the discussion to include properties of nondeterministic bugs.

Often these bugs depend on timing, caused by unpredictable thread scheduling, CPU load, disk and networking timing, etc. Git commits can affect app timing and change the likelihood of the bug occurring, but in many cases these changes aren't related to the underlying bug. That's distinct from a regular git bisect to find a deterministic bug.

One cool bayesect application is to identify the commit that most frequently hits the bug, so it's easier to debug. But more broadly, I'm wondering about the underlying philosophy of bisection for nondeterministic bugs, and when I'd use it?

2 more replies

curuinor1mo ago

Bayesian inference is, to be overly simple, a way to write probabilistic if-statements and fit them from data. The "if" statement in this case is "if the bug is there...", and of course it's often the case that in actual software that if statement is probabilistic in nature. This thing does git bisect with a flaky bug with bayesian inference handling the flakiness to get you a decent estimate of where the bug is in the git history. It seems to be usable, or at least as usable as a Show HN thingy is expected to be.

ncruces1mo ago

If you want to read more about the bisect idea itself, this link has a bunch of interesting use cases:

https://research.swtch.com/bisect

TLDR: you can bisect on more than just "time".

IshKebab1mo ago

Does this tool assume it takes the same amount of time to test two commits once as it does to test one commit twice? Maybe true for interpreted languages, but if you're waiting 15 minutes to compile LLVM you're probably going to want to run your 1 second flaky test more than once. Probably pretty trivial to fix this though?

Great idea anyway!

rs5458371mo ago

One cheap optimization for the compile overhead case: skip commits that only touch files unrelated to the failing test. If you know the test's dependency chain, any commit that doesn't touch that chain gets prior weight zero. Equivalent to git bisect skip but automatic. Cuts the search space before you compile anything.

shiandow1mo ago

> our entropy calculation will now have to use the posterior means

Now hang on for a bit, you can't just plug in averages.

At least that's what I initially thought, but in this particular instance it works out correctly because you're calculating an expected value of the entropy from the two possible outcomes and there the posterior mean is indeed the correct probability to use.

You do have to take the prior into account when calculating the posterior distributions for B, but that formula is in the article.

davidkunz1mo ago

Useful for tests with LLM interactions.

OfirMarom1mo ago

What I love about this is you applied Bayesian methods to a problem I never would have thought to use it on!

convexly1mo ago

Do you expose the posterior probabilities anywhere so you can see how confident it is in the result?

ajb1mo ago

Not the OP, but Fyi you know that to some extent anyway, because the termination condition is that confidence is above a specified value. This is one of the advantages over just doing git bisect with some finger-in-air test repeat factor. But yeah it can print that too.

hauntsaninjaOP1mo ago

Yes! It will show you the posterior probability as a single commit starts to become more likely. In addition, running `git bayesect status` will show you all posterior probabilities

j / k navigate · click thread line to collapse

43 comments

rs5458371mo ago

Really fun work, and the writeup on the math is great. The Beta-Bernoulli conjugacy trick making the marginal likelihood closed-form is elegant.

We're building this into sem (https://github.com/ataraxy-labs/sem), which has an entity dependency graph that provides the structural signal.

sfink1mo ago

> We ran benchmarks comparing bisect vs bayesect across flakiness levels. At 90/10, bisect drops to ~44% accuracy while bayesect holds at ~96%. At 70/30 it's 9% vs 67%.

I don't understand what you're comparing. Can't you increase bayesect accuracy arbitrarily by running it longer? When are you choosing to terminate? Perhaps I don't understand this after all.

rs5458371mo ago

Yes, bayesect accuracy increases with more iterations. The comparison was at a fixed budget(300 test runs) when I was running. Sorry should have clarified more on that.

hauntsaninjaOP1mo ago

Yep, you can run bayesect to an arbitrary confidence level.

For the failure rates you describe, calibration.py shows that you should see much higher accuracy at 300 tests

1 more reply

hauntsaninjaOP1mo ago

In addition to the repo linked in the title, I also wrote up a little bit of the math behind it here: https://hauntsaninja.github.io/git_bayesect.html

ajb1mo ago

Nice! I implemented a similar thing a while back: https://github.com/Ealdwulf/BBChop

I'm going to have to check out how you got linear time with Shannon entropy, because I used Renyi entropy to do that, to make the algebra easier.

Myrmornis1mo ago

1 more reply

belden1mo ago

My team doesn’t always have cleanly bisectable branches being merged to main —- it’s not uncommon to see “fix syntax error” types of commits.

But, to merge we need to have all tests pass. (If tests flakily pass then we get new flakey tests, yay!)

I know git-bisect doesn’t support this: but could git-bayesect have an option to only consider merge commits? Being able to track a flake change back to an individual PR would be really useful.

sluongng1mo ago

You can run bisect with first-parent

hauntsaninjaOP1mo ago

That sounds right. `git_bayesect` currently uses `--first-parent`, so I think belden's use case should work, but I haven't tested it much on complicated git histories.

menaerus1mo ago

> I know git-bisect doesn’t support this

It does.

llasram1mo ago

I haven't worked these all the way through, but I'm slightly skeptical or at least confused by a few details:

Anyway, thanks for sharing!

Retr0id1mo ago

Super cool!

I could pick a threshold and use bayesect as described, but that involves throwing away information. How hard would it be to generalize this to let me plug in a raw benchmark score at each step?

furyofantares1mo ago

I have this same issue a lot.

This is all really hard to track down, there's noise in the heuristics, and I don't know if I'm looking for one really dumb thing or a bunch of small things that have happened over time.

rs5458371mo ago

ajb1mo ago

At a guess, you can reuse the entropy part, but you'd need to plug in a new probability distribution.

hauntsaninjaOP1mo ago

I don't yet know a better way to do this than using a threshold!

(I really enjoy how many generalisations there are of this problem :-) )

sfink1mo ago

Warm starts vs cold starts can also do this. If you don't tear everything down and flush before beginning a test, you might have some amount of stuff cached.

As usual you can probably cover your eyes, stick your fingers in your ears, and proceed as if everything were Gaussian. It'll probably work well enough!

supermdguy1mo ago

Okay this is really fun and mathematically satisfying. Could even be useful for tough bugs that are technically deterministic, but you might not have precise reproduction steps.

hauntsaninjaOP1mo ago

Yay, I had fun with it too!

But ofc part of the magic is that git_bayesect's commit selection tells you how to be maximally sample efficient, so you'd only want to do a batch record if your test has high constant overhead

__s1mo ago

recompiling can be high constant overhead

ajb1mo ago

1 more reply

rjmunro1mo ago

SugarReflex1mo ago

edit: thanks for the responses! I was not even familiar with `git bisect` before this, so I've got some new things to learn.

augusto-moura1mo ago

The writeup [1] linked on the README has examples and a better explanation

[1]: https://hauntsaninja.github.io/git_bayesect.html

Retr0id1mo ago

One way to think about the tool presented is that it minimizes the number of times you'd need to run your test suite, to locate the bad commit.

ajb1mo ago

It's worth noting that the analysis (although not this specific algorithm) applies in cases where there is a deterministic approach, but a nondeterministic algorithm is faster.

teckywoe1mo ago

Opening the discussion to include properties of nondeterministic bugs.

2 more replies

curuinor1mo ago

ncruces1mo ago

If you want to read more about the bisect idea itself, this link has a bunch of interesting use cases:

https://research.swtch.com/bisect

TLDR: you can bisect on more than just "time".

IshKebab1mo ago

Great idea anyway!

rs5458371mo ago

shiandow1mo ago

> our entropy calculation will now have to use the posterior means

Now hang on for a bit, you can't just plug in averages.

You do have to take the prior into account when calculating the posterior distributions for B, but that formula is in the article.

davidkunz1mo ago

Useful for tests with LLM interactions.

OfirMarom1mo ago

What I love about this is you applied Bayesian methods to a problem I never would have thought to use it on!

convexly1mo ago

Do you expose the posterior probabilities anywhere so you can see how confident it is in the result?

ajb1mo ago

hauntsaninjaOP1mo ago

Yes! It will show you the posterior probability as a single commit starts to become more likely. In addition, running `git bayesect status` will show you all posterior probabilities

j / k navigate · click thread line to collapse