The thing is, that's a capability which is likely to get added into AI harnesses very soon in the future. Look at the difference between Bing Chat and ChatGPT. Now consider when we have not a 32k context length model, but a 1m CL model, or even 1g CL, and that model can search the web for information, act on it, compile and test the code, diagnose the errors by again searching the web for information, etc. ChatGPT is the simplest possible use of the model, with nothing but user input -> model output. The thing is, tests have already been done with having these models act as agents, and they perform exceedingly well when appropriately instructed. It's not about right now. It's about 1 year from now, or maybe even merely 6 months from now.
I understand the expectation, but I am definitely in the "I'll see it when I believe it camp." Which is not to say that I am not impressed by what it can do now. I am. There is just no way I would trust it without first thoroughly examining the code it produced.