One big misconception is that these models are trained to mimic humans and are limited by the quality of the human training data, and this is not true and also basically almost entirely the reason why you have so much bullishness and premature adoption of agentic coding tools.
Coding agents use human traces as a starting point. You technically don’t have to do this at all but that’s an academic point, you can’t do it practically (today). The early training stages with human traces (and also verified synthetic traces from your last model) get you to a point where RL is stable and efficient and push you the rest of the way. It’s synthetic data that really powers this and it’s rejection sampling; you generate a bunch of traces, figure out which ones pass the verification, and keep those as training examples.
So because
- we know how this works on a fundamental level and have for some time
- human training data is a bootstrap it’s not a limitation fundamentally
- you are absolutely right about your observations yet look at where you are today and look at say Claude sonnet 3.x. It’s an entire world away in like a year
- we have imperfect benchmarks all with various weaknesses yet all of them telling the same compelling story. Plus you have adoption numbers and walled garden data that is the proof in the pudding
The onus is on people who say “this is plateauing” or “this has some fundamental limitation that we will not get past fairly quickly”.