What they fail at is code with high cyclomatic complexity. Back in the llama 2 finetune days I wrote a script that would break down what each node in the control flow graph into its own prompt using literate programming and the results were amazing for the time. Using the same prompts I'd get correct code in every language I tried.