Right, I worded that a bit lazily. There's no confidence score output from GPT-3, but if there were and if the user would select to only get high confidence results then it would shut up quickly. And that's what I meant by common sense. Of course it depends on the corpus. It's really-really just text, as you said. (It's possible that it can somehow eventually encode high level things like arithmetic, but so far it seems, even if it does have that model somewhere embedded, it doesn't know how/when to use it.)
The language model (GPT-3) doesn't have to understand physics, it just have to help extract out some semantics of the paper.
There needs to be a classifier on top trained with a labeled set of good and bad papers.