We discuss this a bit in Section D.2 (HOW UNSEEN ARE THE HELD-OUT TASKS?). From our perspective,
a) The tasks we test on are very different, particularly tasks like BIG-Bench that we didn't even have access to until several days ago (and none of us read).
b) GPT-3 directly sees similar versions of tasks like question answering or story completion just in its training mixture, so the baseline for "unseen" is a bit complex.