undefined | Better HN

0 pointsyorwba14d ago0 comments

How many samples you need depends on the difference you want to be able to measure (0% to 1% is different from 50% to 51% is different from 0% to 10% is different from 50% to 60%), the significance level at which you will declare a difference (conventionally, p < 0.05) and how likely you want this to happen when there is indeed such a difference (statistical power, conventionally 80%). Of course you can also just sample an arbitrary number of times and compute confidence intervals after the fact, but doing a statistical power computation helps clarify what it is you want to know, how certain you want to be, and whether you can realistically achieve such knowledge with the budget you have.

0 comments

dataviz100014d ago

To solve the lambda calculus problem Sonnet burns 8,163 - 17,334 tokens on 5 runs.

If I want to engineer a prompt, starting with the tokens which are clearly better in the one with 8,163 will yield a better agent.

If I build an agent that does something arbitrary like reverse engineer any website or multiplies 2 large numbers without a tool that allows it to use code, the mechanics of the reasoning work the same as an agent solving lambda calculus. Running 39,440 trials is prohibitory expensive. Nonetheless, without perfect proof, I want to say running an agent several times and then take any generalized output from the fastest runs yields much faster generalized agent that solves that specific task given different parameters.

That is something I really want to know. If I have an agent that reverse engineers websites, can I take the thinking output from the best running and use that to seed a better agent? I don't know how to set up the experiment. And asking ChatGPT has been futile especially and running it is very expensive. How do I set up that experiment?

yorwbaOP14d ago

You could try a sequential testing setup, which can let you stop the experiment earlier if the difference is larger than expected. But if the difference is small, there's no way around the fact that reliably detecting small differences requires large sample sizes, and the relationship is inverse quadratic (halving the smallest detectable difference quadruples the sample size you need).

j / k navigate · click thread line to collapse

0 comments

dataviz100014d ago

To solve the lambda calculus problem Sonnet burns 8,163 - 17,334 tokens on 5 runs.

If I want to engineer a prompt, starting with the tokens which are clearly better in the one with 8,163 will yield a better agent.

yorwbaOP14d ago

j / k navigate · click thread line to collapse