And, if I'm reading their calculation right, that's 85% on the medium-difficulty bucket, not even the entire HumanEval benchmark?
(quoting from the GPT-4 paper):
>All but the 15 hardest HumanEval problems were split into 6 difficulty buckets based on the performance of smaller models. The results on the 3rd easiest bucket are shown in Figure 2
That does seem to support the idea that we're two or three major breakthroughs away from superintelligent AGI, assuming these scaling curves keep holding as they have.