The market's being split into
1. Longitudinal LLM observability tooling
Most eval startups have gone down the route of something more like being an observability platform for LLM inference. They want to be in your stack and running the inference to collect data on performance of it.
They collect things like how often a model returns JSON that's out of spec or returns values that aren't expected as well as general timing and cost info.
2. Safety Limiting / Pentesting
Say you're doing something in the medical field or that's sensitive in some way and you want to figure out what model has the best outputs for your task that won't fly off the guardrails.
3. Simple cost + performance + quality swapping
This is what my tool does, basically lets you test if you _really_ need to be running that frontier model in a loop across a million records or if you'd be better with an older model or something else.
Example eval: https://giyd8stidy.evvl.io
Also, model providers are not interested to have their models compared head-to-head under identical conditions. And “Model A is better than Model B” is almost meaningless by itself. Better for what task? With what prompt? What inputs? What budget? What failure tolerance?
It would be nice to have a place where users could run their own benchmarks, define evaluation criteria for their actual use cases, and make those runs verifiable by others.
It's probably not gonna be exactly glorious work, but designing expert evals settings and collecting and crunching the data for quality assurance and control is going to be needed.
Therefore your knowledge is better used in training than letting users be slightly better at the token casino. Which is mentioned in this post as well, eval startup people either go to work at frontier labs or finetune startups.
Personally, I agree with the Goodhart problem, but isn't the reason Eval startups fail because they try to sell an 'evaluation service' rather than a 'verification toolchain'? The problem, it seems, is that AI verification toolchains require a model in the end, because they internalize AI and sell it under the name of a 'harness.'
So an AI verification(eval) toolchain would have to be structurally different. Verifying AI code isn't about whether it compiles. AI code can always be made to compile. The issue involves various semantic criticisms, such as overfitting to existing designs and tests. To catch those issues, you ultimately need to build an AI. But building that AI is expensive. So in the end, AI verification companies depend on external model providers for the core components of their verification engine. I think this is a bad business decision
Aha.