I haven't noticed a change in what I trust a model to generate in response to a single prompt in a year. The failure modes are unchanged. Yes, specific failures have improved as they have been documented and passed into model training data, but the way the models fail has not changed. They still fail for me nearly every single day. I'm a pretty heavy user - 3-4 Claude code processes running at a time, all day every day.
What has gotten better is tooling around the model -- but there's no space for exponential growth there. At least, not without exponential cost increase, which would make the whole thing untenable anyway.