undefined | Better HN

0 pointsfamouswaffles2y ago0 comments

It's as good or better than experts on 7/18 of those benchmarks. On an additional 4, it's close (within 0.05).

>By definition, it can't outperform the expert ensemble because that's where the gold labels come from.

The ensemble no but it can outperform an expert trying to solve it. But yes the benchmarks are biased to the experts here.

0 comments

3 comments · 1 top-level

lisasays2y ago· 2 in thread

Thanks, I confess to skimming and missed the individual 'expert' column (presumably the mean of the individual expert scores).

That said -- it looks like not only does model+ do worse than experts on the other 12/18 (not 11/18 by my counting), but when it does, it does so by a significantly wider margin (2x-3x on average). For example, the maximum model+ outperformance is on label 'Stealing' (0.11) while there are 6 labels for which the expert outperforms (by margins ranging from 0.12 to 0.29).

In other words: distinctly sub-par compared with the average expert. Which is probably why they didn't claim it as a result in the paper :)

famouswafflesOP2y ago

It's distinctly sub-par on 6 of 18 benchmarks while close or better in 12. That's why I said "mostly on par"

lisasays2y ago

Seems you got it reversed - it's the expert which performs sub-par on 6 rounds, while doing better on 12.

While missing also the part about when it outperforms, it does so "at a significantly wider margin (2x-3x)". Which is why, no, it's not "mostly on par".

Just look at the data.

j / k navigate · click thread line to collapse