In my applications, GPT-4 connected to a VM or SQL engine can and does debug code when given error messages. "Reliably" is very subjective. The main problem I have seen is that it can be stubborn about trying to use outdated APIs and it's not easy to give it a search result with the correct API. But with a good web search and up to date APIs, it can do it.
I'm interested to see general coding benchmarks for Code Llama versus GPT-4.