I think this closer to the crux of a major problem. Seemingly people have vastly different responses even for the same system/developer/user prompts, and I myself can feel a different in quality of the responses depending on when I use the hosted APIs, while hosted models always have consistent results.
For example, after 19:00 sometime (GMT+1), the response quality of both OpenAI and Anthropic (their hosted UIs) seems to drop off a cliff. If I try literally the same prompt the around 10:00 next morning, I get a lot better results.
I'm guessing there is so much personalization and other things going on, that two users will almost never have the same experience even with the same tools, models, endpoints and so on.