>While LLMs appear extremely intelligent and capable of reasoning, they sometimes make mistakes that seem inconceivably foolish from a human perspective. For example, GPT-5.2 can implement complex fluid dynamics simulation code, yet it cannot even compute the parity of the short string 11000, cannot determine whether the parentheses in ((((()))))) are balanced, and makes calculation errors on 127 × 82 (Figure 1).
Why would they say it is capable of reasoning and then not allow it to reason in the experiment?