- the prevalence "How many |r|'s are in the word 'strawberry'?" esque questions that cause(d) LLMs to stumble
- context window issues
It would be naive to claim that there does not exist, or even that it would be difficult to construct/train, an interrogator that could reliably distinguish between an LLM and human chat instance.
[0]: https://archive.computerhistory.org/projects/chess/related_m...
As for writing in general slop score is still higher than a human baseline for all models[1], so all a human tester has to do is grade it and make the human write a bunch, the interrogator is allowed to submit an arbitrarily long list of questions.
The Turing test could also be considered equivalent to "can humans come up with questions that break the AI?" and the answer to that is still yes I'd say.