undefined | Better HN

0 pointsembedding-shape4mo ago0 comments

At first when I got started with using LLMs I read/analyzed benchmarks, looked at what example prompts people used and so on, but many times, a new model does best at the benchmark, and you think it'll be better, but then in real work, it completely drops the ball. Since then I've stopped even reading benchmarks, I don't care an iota about them, they always seem more misdirected than helpful.

Today I have my own private benchmarks, with tests I run myself, with private test cases I refuse to share publicly. These have been built up during the last 1/1.5 years, whenever I find something that my current model struggles with, then it becomes a new test case to include in the benchmark.

Nowadays it's as easy as `just bench $provider $model` and it runs my benchmarks against it, and I get a score that actually reflects what I use the models for, and it feels like it more or less matches with actually using the models. I recommend people who use LLMs for serious work to try the same approach, and stop relying on public benchmarks that (seemingly) are all gamed by now.

0 comments

2 comments · 1 top-level

cdelsolar4mo ago· 1 in thread

embedding-shapeOP4mo ago

The harness? Trivial to build yourself, ask your LLM for help, it's ~1000 LOC you could hack together in 10-15 minutes.

As for the test cases themselves, that would obviously defeat the purpose, so no :)

2 more replies

j / k navigate · click thread line to collapse