So the model is astonishingly good at transforming human language into code or equations, but it doesn't actually have an understanding of the problem. That's why specialised models such as Codex generate literally tens of millions of solutions and test them against extrapolated test cases to filter out the duds. ChatGPT doesn't do that.
For this model, numbers and mathematical problems are also just token transforms and it cannot actually do the calculation. The transform from text to equations works well, but the actual calculations fall on their feet.
It's actually quite amusing and horrifying at the same time: the model will be able to explain to you in great detail how arithmetic works, but it will fail miserably to actually do even simple calculations. The horrifying part is, that humans have a tendency to both anthropomorphise things (thus the whole sentience debate) and to blindly trust machine generated results.
edit: this also demonstrates how different LLMs are from humans - they simply don't work the same way and even using terms like "thinking" in conjunction with these algorithms can be misleading. Maybe we need new terminology when talking about what these systems do.