undefined | Better HN

0 pointsjahala17d ago0 comments

This is the reason, when I built a tool in the same space, I chose to benchmark with cost per correct answer.

Reducing tokens and also turns is quite worthless if the LLM doesn’t solve what you put it to do.

0 comments

5 comments · 1 top-level

esafak17d ago· 4 in thread

Did you benchmark the competition and can we see?

jahalaOP16d ago

No I don't have the funds to benchmark the competition, but would be happy to put the numbers up if any token whales feel like having a go.

https://github.com/jahala/tilth/tree/main/benchmark

alex7o16d ago

Oh that is a nice approach whish more benchmarks did cost per successful

onlyrealcuzzo17d ago

The problem even attempting to develop a tool for the frontier model space is that the cost to run a statistically significant benchmark is almost certainly going to be over $100 - for a single model.

Unless something is like 25%+ more cost effective on Gemini for a task, I would not assume those savings are going to transfer to GPT.

If you need to run a test this expensive and slow for every release, hobbiests aren't going to do it.

And if you wanted any broadly specific improvements to coding like they all claim, the costs would be in the thousands per release even for a single for a single model.

And they almost certainly would not be eye popping.

If the models could be SUBSTANTIALLY better, Google and Anthropic and OpenAI wouldn't be finding that out from a hobbiest making wildly unscientific claims.

jahalaOP16d ago

Yup, this is hitting it on the nose. But, despite the cost - the benchmark is the vital ingredient that cant be skipped. Otherwise, you don't know if what you're building is actually helping the agent rather than hindering it.

On the previous large benchmark run, i proved 40-50% cost reduction per correct answer.

I'm not sure why the vendors aren't using token filtering/compression more in their tooling, but perhaps they don't mind users feeding them more data and using more data.

j / k navigate · click thread line to collapse