LoCoMo AI Benchmark: 6.4% of answer key wrong, judge accepts 63% of fake answers (opens in new tab)

(github.com)

3 pointsdial4811mo ago3 comments

3 comments

We audited the LoCoMo benchmark (one of the most cited eval for LLM agent memory) and found 99 score-corrupting errors in 1,540 questions (6.4%). Separately, we tested the LLM judge with adversarially generated wrong answers, it accepted 62.81% of vague-but-topical wrong answers. Some published system scores barely clear that bar. Full audit with methodology, all 99 errors documented, and reproducible scripts.

PaulHoule1mo ago

I've worked in IR and this has been true about TREC data sets from the beginning and it has also been true about visual data sets. The first step to build a world beating commercial system has been to clean up the garbage in open evals to raise the possible accuracy ceiling.

dial481OP1mo ago

That's encouraging to hear from someone with IR experience, thanks. Agree completely.

j / k navigate · click thread line to collapse

3 comments

dial481OP1mo ago

PaulHoule1mo ago

dial481OP1mo ago

That's encouraging to hear from someone with IR experience, thanks. Agree completely.

j / k navigate · click thread line to collapse