Why eval startups fail (2025) (opens in new tab)

(thomasliao.com)

43 pointsjxmorris121d ago38 comments

38 comments

13 comments · 13 top-level

I built a simple (free) eval tool for my own uses (Github Gists + Model Outputs) after not being able to find a suitable one in the market.

The market's being split into

1. Longitudinal LLM observability tooling

Most eval startups have gone down the route of something more like being an observability platform for LLM inference. They want to be in your stack and running the inference to collect data on performance of it.

They collect things like how often a model returns JSON that's out of spec or returns values that aren't expected as well as general timing and cost info.

2. Safety Limiting / Pentesting

Say you're doing something in the medical field or that's sensitive in some way and you want to figure out what model has the best outputs for your task that won't fly off the guardrails.

3. Simple cost + performance + quality swapping

This is what my tool does, basically lets you test if you _really_ need to be running that frontier model in a loop across a million records or if you'd be better with an older model or something else.

https://evvl.ai/

Example eval: https://giyd8stidy.evvl.io

2 more replies

PashaGo45m ago

Unfortunately, model quality is not the only criterion for users, and often not even the most important one. Adoption is also driven by marketing, UX, integrations, pricing, ecosystem, and a lot of other non-benchmark factors.

Also, model providers are not interested to have their models compared head-to-head under identical conditions. And “Model A is better than Model B” is almost meaningless by itself. Better for what task? With what prompt? What inputs? What budget? What failure tolerance?

It would be nice to have a place where users could run their own benchmarks, define evaluation criteria for their actual use cases, and make those runs verifiable by others.

theteapot4h ago

What's an eval?

4 more replies

jampekka3h ago

I think there's gonna be (or perhaps already is) a huge demand for evaling individual systems. Many countries are starting to adopt some criteria for LLM usage for public use, and I doubt govs are gonna develop in-house knowhow for this. These will likely form some kinds of "independent auditor" models, as the system provider has too strong conflicts of intetest.

It's probably not gonna be exactly glorious work, but designing expert evals settings and collecting and crunching the data for quality assurance and control is going to be needed.

torginus3h ago

Imo it's very simple - AI is a big function inverter. If you have a better cost function than frontier labs, as in, you are better at judging model output quality, then you can use that cost function to RL the next generation of models.

Therefore your knowledge is better used in training than letting users be slightly better at the token casino. Which is mentioned in this post as well, eval startup people either go to work at frontier labs or finetune startups.

GL264h ago

The problem with eval is the fact that the information is not updating itself fast enough so that you want the latest model performance benchmarks. Bloomberg succeeded because it sells info that is expires in the next hour.

PaulHoule2h ago

Worked or tried to work for a few places that ended eval work in the 2010s for previous-gen systems. Most didn’t pay for it, thanks to all the ones that didn’t I didn’t dare try selling it to the one that would have.

h1fra2h ago

evals are glorified integration tests, would you invest in an integration test startup? absolutely not. I don't get why we are making all of this fuzz around evals

2 more replies

jdw644h ago

If you look at the history of software engineering, the ones that made the most money were usually not the companies that built the applications themselves, but the ones that built the tools to verify, deploy, and build them, such as CI/CD, static analysis tools, and testing frameworks.

Personally, I agree with the Goodhart problem, but isn't the reason Eval startups fail because they try to sell an 'evaluation service' rather than a 'verification toolchain'? The problem, it seems, is that AI verification toolchains require a model in the end, because they internalize AI and sell it under the name of a 'harness.'

So an AI verification(eval) toolchain would have to be structurally different. Verifying AI code isn't about whether it compiles. AI code can always be made to compile. The issue involves various semantic criticisms, such as overfitting to existing designs and tests. To catch those issues, you ultimately need to build an AI. But building that AI is expensive. So in the end, AI verification companies depend on external model providers for the core components of their verification engine. I think this is a bad business decision

2 more replies

nilirl2h ago

Maybe it's not that valuable? No snark, but how much confidence do these evals provide?

1 more reply

coldtea3h ago

Because they operate on untrusted input

bitlad4h ago

Everything eventually fails. Nothing is constant, not even evals.

1 more reply

wseqyrku4h ago

> Not enough eval customers

Aha.

j / k navigate · click thread line to collapse

38 comments

13 comments · 13 top-level

michaelbuckbee2h ago

I built a simple (free) eval tool for my own uses (Github Gists + Model Outputs) after not being able to find a suitable one in the market.

The market's being split into

1. Longitudinal LLM observability tooling

They collect things like how often a model returns JSON that's out of spec or returns values that aren't expected as well as general timing and cost info.

2. Safety Limiting / Pentesting

Say you're doing something in the medical field or that's sensitive in some way and you want to figure out what model has the best outputs for your task that won't fly off the guardrails.

3. Simple cost + performance + quality swapping

https://evvl.ai/

Example eval: https://giyd8stidy.evvl.io

2 more replies

PashaGo45m ago

It would be nice to have a place where users could run their own benchmarks, define evaluation criteria for their actual use cases, and make those runs verifiable by others.

theteapot4h ago

What's an eval?

4 more replies

jampekka3h ago

It's probably not gonna be exactly glorious work, but designing expert evals settings and collecting and crunching the data for quality assurance and control is going to be needed.

torginus3h ago

GL264h ago

PaulHoule2h ago

h1fra2h ago

evals are glorified integration tests, would you invest in an integration test startup? absolutely not. I don't get why we are making all of this fuzz around evals

2 more replies

jdw644h ago

2 more replies

nilirl2h ago

Maybe it's not that valuable? No snark, but how much confidence do these evals provide?

1 more reply

coldtea3h ago

Because they operate on untrusted input

bitlad4h ago

Everything eventually fails. Nothing is constant, not even evals.

1 more reply

wseqyrku4h ago

> Not enough eval customers

Aha.

j / k navigate · click thread line to collapse