undefined | Better HN

0 pointsenergy1231y ago0 comments

Similar to bootstrapping a random variable in statistics. Your N estimates (each estimate is derived from a subset of the sample data) give you an estimate of the distribution of the random variable. If the variance of that distribution is small (relative to the magnitude of the point estimate) then you have high confidence that your point estimate is close to the true value.

Likewise in your metric, if all answers are the same despite perturbations then it's more likely to be ... true?

I'd really like to see a plot of your metric versus the SimpleQA hallucation benchmark that OpenAI uses.

0 comments

2 comments · 1 top-level

hohloma1y ago· 1 in thread

Confidence != true

energy123OP1y ago

That's correct but P(true) might empirically turn out to be some f(confidence)

j / k navigate · click thread line to collapse