It's a little bit more complex than that.
My personal benchmark is to ask about myself. I was in a situation a little bit analogous to Musk v. Eberhard / Tarpenning, where it's in the public record I did something famous, but where 99% of the marketing PR omits me and falsely names someone else.
I ask the analogue to "Who founded Tesla." Then I can screen:
* Musk. [Fail]
* Eberhard / Tarpenning. [Success]
A lot of what I'm looking for next is the ability to verify information. The training set contains a lot of disinformation. The LLM, in this case, could easily tell truth from fiction from e.g. a git record. It could then notice the conspicuous absence of my name from any official literature, and figure out there was a fraud.
False information in the training set is a broad problem. It covers politics, academic publishing, and many other domains.
Right now, LLMs are a popularity contest; they (approximately) contain the opinion most common in the training set. Better ones might look for credible sources (e.g. a peer-reviewed paper). This is helpful.
However, a breakpoint for me is when the LLM can verify things in its training set. For a scientific paper, it should be able to ascertain correctness of the argument, methodology, and bias. For a newspaper article, it should be able to go back to primary sources like photographs and legal filings. Etc.
We're nowhere close to an LLM being able to do that. However, LLMs can do things today which they were nowhere close to doing a year ago.
I use myself as a litmus test not because I'm egocentric or narcissistic, but because using something personal means that it's highly unlikely to ever be gamed. That's what I also recommend: pick something personal enough to you that it can't be gamed. It might be a friend, a fact in a domain, or a company you've worked at.
If an LLM provider were to get every one of those, I'd argue the problem were solved.