https://arr.am/2020/07/25/gpt-3-uncertainty-prompts/
It's really cool how the uncertainty prompts alter the confidence associated with the next words.
I guess I'm not disagreeing with you in the abstract that a theoretically strong enough AI could identify bad papers, especially if it had some help for 'real' arithmetic. It at least could flag the most basic issues like plagiarism, cited documents that don't contain the cited fact, etc. Detecting claims that are themselves implausible seems like the hardest task possible, however. That's very close to general AI.