I mean that's just confabulating the next token with extra steps... ime it does get those wrong sometimes. I imagine there's an extra internal step to validate the syntax there.
I'm not arguing for or against anything specifically, I just want to note that in practice I assume that to the LLM it's just a bunch of repeating prompts with the entire convo, and after outputting special 'signifier' tokens, the llm just suddenly gets a prompt that has the results of the program that was executed in an environment. for all we know various prompts were involved in setting up that environment too, but I suspect not.