Even Claude opus 4.6 is pretty willing to start tearing apart my tests or special-case test values if it doesn't find a solution quickly (and in c++/rust land a good proportion of its "patience" seems to be taken up just getting things that compile)
I’ve found that GPT-5.2 is shockingly good at producing code that compiles, despite also being shockingly good at not even trying to compiling it and instead asking me whether I want it to compile the code.