Skip to content
Better HN
Top
New
Best
Ask
Show
Jobs
Search
⌘K
Bullshit benchmark for LLMs | Better HN
Bullshit benchmark for LLMs
(opens in new tab)
(twitter.com)
1 points
gpvos
28d ago
1 comments
Share
1 comments
default
newest
oldest
noemit
28d ago
The underlying data looks scarce. If there's only a few questions per "category" of bullshit they can easily be gamed to favor one model over another.
j
/
k
navigate · click thread line to collapse