Bullshit benchmark for LLMs (opens in new tab)

(twitter.com)

1 pointsgpvos28d ago1 comments

1 comments

The underlying data looks scarce. If there's only a few questions per "category" of bullshit they can easily be gamed to favor one model over another.

j / k navigate · click thread line to collapse

Bullshit benchmark for LLMs (opens in new tab)

(twitter.com)

1 pointsgpvos28d ago1 comments

1 comments

noemit28d ago

The underlying data looks scarce. If there's only a few questions per "category" of bullshit they can easily be gamed to favor one model over another.

j / k navigate · click thread line to collapse