Not picking on you - this brings up something we could all get better at:
There should be a "First Rule of Critiquing Models": Define a baseline system to compare performance against. When in doubt, or for general critiques of models, compare to real world random human performance.
Without a real practical baseline to compare with, its to easy to fall into subjective or unrealistic judgements.
"Second Rule": Avoid selectively biasing judgements by down selecting performance dimensions. For instance, don't ignore difference in response times, grammatical coherence, clarity of communication, and other qualitative and quantitative differences. Lack of comprehensive performance dimension coverage is like comparing runtimes of runners, without taking into account differences in terrain, length of race, altitude, temperature, etc.
It is very easy to critique. It is harder to critique in a way that sheds light.