I think you and I have different definitions of “one-shotting”. If the model has to be steered, I don’t consider that a one-shot.
And you clearly “broke” the model a few times based on your prompt log where the model was unable to solve the problem given with the spec.
Honestly, your experience in these repos matches my daily experience with these models almost exactly.
I want to see good/interesting work where the model is going off and doing its thing for multiple hours without supervision.