According to a 2025 Stanford HAI report, large language models fail basic multi-step arithmetic up to 40% of the time without external tools.
https://medium.com/@dojolabs.main/why-does-ai-get-math-wrong...
You may know this somehow --- but I don't. Without a fundamental re-design, the basic problem will remain.
I don't believe it is possible to apply statistics to predict answers without significant errors.