Thanks, I confess to skimming and missed the individual 'expert' column (presumably the mean of the individual expert scores).
That said -- it looks like not only does model+ do worse than experts on the other 12/18 (not 11/18 by my counting), but when it does, it does so by a significantly wider margin (2x-3x on average). For example, the maximum model+ outperformance is on label 'Stealing' (0.11) while there are 6 labels for which the expert outperforms (by margins ranging from 0.12 to 0.29).
In other words: distinctly sub-par compared with the average expert. Which is probably why they didn't claim it as a result in the paper :)