First of all, multiplying two numbers together is not "shifting the goal posts", but an absolutely basic test of any system that is claimed to able to do mathematical reasoning. I
know that LLM's are not well suited for this, and that's because
they cannot do arithmetic (among other things).
So I tried a word puzzle that would also require simple multiplication:
------------------------------
PROMPT: i am going to cycle 1600 miles, with 234 miles on gravel roads. on paved roads i will ride at 1929288282 millimeters per second but on gravel I will ride at 0.00000000202 parsecs per second. How long will the journey take?
-------------------------------
Now, I have to commend GPT on its ability to understand how you solve a problem like this, though that's not really very surprising given the huge numbers of such problems that exist in written materials. It precisely broke the problem down in a way that I suppose you could call "reasoning", but I would call "copying the formula for solving puzzles like this".
And how did it do with the actual math?
----------------
0.00000000202 parsecs per second is equivalent to 7499.6103827 miles per hour (mph), which we can calculate by converting parsecs to miles (1 parsec = 3.26 light-years = 19,173,511,840,000 miles) and dividing by the number of seconds in an hour:
0.00000000202 pc/s × 19,173,511,840,000 mi/pc ÷ 3600 s/hr = 7499.6103827 mph
----------
Utterly and completely wrong. Same issue with the millimeters per second computation.
It is completely obvious why LLMs cannot do this. They cannot perform even basic arithmetic reasoning, and even more fundamentally, the ONLY capability they have is to create likely responses to prompts. For some things, this is extraordinarily (and scarily) powerful. But it is not reasoning.