You seem to disagree. Here's an interesting study where the researchers used an OpenAI-LLM-based tool to grade student papers and by grading them 10 times in a row, they got vastly different results:
Quote: "The results reveal significant shortcomings: The tool’s numerical grades and qualitative feedback are often random and do not improve even when its suggestions are incorporated."
A real user will be worse … but that’s kinda the point.
The most valuable thing you learn in usability/research is not if your experience works, but the way it’ll be misinterpreted, abused, and bent to do things it wasn’t designed to.