This is a table stakes feature for even the open/free models today.
It doesn't use up any "context" to remember that calculators exist.
With the partial exception of Python code, models know how to solve problems with code and even 1.5B models do amazing things if you prompt them with something like "Use the Z3 solver"
Running an LLM and parsing and computing mathematical expressions are entirely disjoint operations. You need highly specialized code for each, it makes just as much sense to put a calculator in your LLM as it does to stuff a Python interpreter in a calculator. Could you? Of course, software is infinitely flexible. Does it make sense to do it? No, it makes more sense to connect two different specialized applications than to try shoehorning one into the other.
There are going to be some level of hallucination errors in the translation to the agent or code. If it is a complex problem, those will compound.
It could also propose to the user it could write the answer using code. It doesn't do that either.
> Now, some might interject here and say we could, of course, train the LLM to ask for a calculator. However, that would not make them intelligent. Humans require no training at all for calculators, as they are such intuitive instruments. We use them simply because we have understanding for the capability they provide.
So the real question behind the headline is why LLMs don't learn to ask for a calculator by themselves, if both the the definition of a calculator and the fact that LLMs are bad at math are part of the training data.
We often use the colloquial definition of training to mean something to the effect of taking an input, attempting an output, and being told whether that output was right or wrong. LLMs extend that to taking a character or syllable token as input, doing some computation, predicting the next token(s), and seeing if that was right or wrong. I'd expect the training data to have enough content to memorize single-digit multiplication, but I'd expect it to also learn that this model doesn't work for multiplying an 11 digit number by a 14 digit number.
The "use a calculator" concept and "look it up in a table" concepts were taught to the LLM too late and it didn't internalize that as a way to perform better.
I don't think that's even true though. If you think this, I would suggest you've just internalized your training on the subject.
They can. They're sometimes a bit cocky about their maths abilities, but this really isn't hard to test or show.
https://gist.github.com/IanCal/2a92debee11a5d72d62119d72b965...
They can also create tools that can be useful.
Of course the model will use the calculator you've explicitly informed it of. The article is meant to be a critique of claims that LLMs are "intelligent," when, despite knowing their math limitations, don't generally answer "You'd be better off punching this into a calculator" when asked a problem
https://githubnext.com/projects/gpt4-with-calc/
and another example
https://www.pinecone.io/learn/series/langchain/langchain-too...
LLMs often do a good job at mathy coding, for instance I told Copilot that "i want a python function that computes the collatz sequence for a given starting n and returns it as a list"
def collatz_sequence(n):
sequence = [n]
while n != 1:
if n % 2 == 0:
n = n // 2
else:
n = 3 * n + 1
sequence.append(n)
return sequence
which gives right answers, which I wouldn't count on copilot being able to do on its own. sum(x**2 for x in numbers if x % 3 == 0)
and to do the same for a pandas series with pandas operators after asking it to inline something (numbers[numbers % 3 == 0] ** 2).sum()
It's not a miracle, you have to go at it with some critical thinking and testing and it makes mistakes but so does Stack Overflow.Yes, you can add tools. But hallucinations will still be there. Tools allow you to cut down on the steps the LLM has to perform. However, if you have a complex problem with many steps, there will be translation errors at some point coordinating the tools.
Furthermore, if there is some other tool needed to get the result you need, the LLM isn't going to tell you. It will typically make up the result.
Try doing coding with Cursor or Windsurf, those use tools all the time. Windsurf sometimes has trouble for me on Windows because it wants to write paths like
/c:/something/or/other
and it will try to run its tool, get an error, ask me for help, I'll tell it "you're running on Windows and you can't write / before the c: and you should write \ instead of /" and it does better. I just asked copilot to multiply
839162704321847925107309452196847230165937402194385627409536218 * 582930174682375093104627481695
and the first thing it did was write out the expression, I told it I wanted the integer and it gave the same answer Python gives which is 489173261817269091475894827953471001727389372345981246974410480760096492908180614917234529510Agreed. Again, that is not the issue. It is that the LLM does not know it is a waste of time. That is apparent to you as you have intelligence. It is not apparent to the LLM. It is not intelligent.
Does the author really believe humans are born with an innate knowledge of calculators and their use?
That said, I was using simple +*-/ calculators as a small child and I don't think I needed to be taught anything other than MC/MR. The tool is intuitive if you are familiar with formal written arithmetic (of course hunter-gatherers couldn't make sense of it).
Author is simply being obtuse and presumably has some axe to grind or is just ignorant of how LLMs are trained. For example, LLMs don’t learn to chat from the data, they have to be instruct tuned to make that happen. Every LLM chatbot you’ve ever used had to have this extra training step. Further, this is the exact same training process that can also train for tool use.
Trying to say “this should just happen from the data” is silly, it isn’t how any of this works. It’s not how you learned things, and it’s not how LLMs-as-chatbots work.
> Trying to say “this should just happen from the data” is silly, it isn’t how any of this works. It’s not how you learned things, and it’s not how LLMs-as-chatbots work.
Yes, that was the entire point of the article. We are in agreement.
The post is honestly quite strange. "When LLMs try and do math themselves they often get it wrong" and "LLMs don't use tools" are two entirely different claims! The first claim is true, the second claim is false, and yet the article uses the truth of the first claim as evidence for the second! This does not hold up at all.
The solution so far has just been to throw more RL or carefully crafted synthetic data at it but its arguably more pavlovian than it is generalized learning.
Someone could teach a dog to ring a bell that says "food" on it, and you could reasonably argue that it is using a tool. Will it then know to ring a bell that says "walk" when it wants to go outside?
https://gist.github.com/IanCal/2a92debee11a5d72d62119d72b965...
I love the "I'm an expert, I ask simple questions and reveal profound truths" vibe.
1. LLMs "think" in terms of tokens, which usually are around 4 characters each. While humans only have to memorize 10x10 multiplication tables to perform multiplication of large numbers, LLMs have to memorize a 10000x10000 table, which is much more difficult.
2. LLMs can't "think in their head", so you have to make them spell out each step of the multiplication, just like (most) humans can't multiply huge numbers without intermediate steps.
A simple way to demonstrate this is to ask an LLM for the birth year of a celebrity and then whether that number is even or odd. The answer will be correct almost every time. But if you ask whether the birth year of a celebrity is even or odd and forbid spelling out the year, the accuracy will be barely above 50 %.
Is that bad? Idk. If you hoped that real AGI would eventually solve humanities biggest problems and questions, perhaps so. But if you want something that really really looks like AGI except to some nerds who still say "well actually", then it's gonna be good enough for most. And certainly sufficient for ending up the dystopia from that movie clip in the end.
I don't believe current LLMs are AGIs but this article's argument is a poor one.
LLMs are great at certain tasks. Databases are better at certain tasks. Calculators too. While we could continually throw more and more compute at the problem, growing layers and injecting more data, wouldn’t it make more sense to just have an LLM call its own back-end calculator agent? When I ask it for obscure information maybe it should just pull from it’s own internal encyclopedia database.
Let LLMs do what they do well, but let’s not forget the decades that brought us here. Even the smartest human still uses a calculator, so why doesn’t an AI? The fact that it writes its own JavaScript is flashy as hell but also completely unnecessary and error prone.
Yes, that is a key point. It isn't to say they are useless tools, but that they aren't intelligent tools and that has significant meaning for what tasks we think they are appropriate for.
Unfortunately, nearly everyone has misinterpreted the intent as showing LLMs can't use tools. The point is about how LLMs work differently than most think that they do.
The [>_] links to the Python code that was run.
https://chatgpt.com/share/67b79516-9918-8010-897c-ba061a2984...
"Calculate the difference between the two biggest primes less than the factorial of 20"
it wrote the following code:
import sympy
# Calculate 20!
factorial_20 = sympy.factorial(20)
# Find the two largest primes less than 20!
largest_prime = sympy.prevprime(factorial_20)
second_largest_prime = sympy.prevprime(largest_prime)
# Calculate the difference
difference = largest_prime - second_largest_prime
difference
executed it, and produced the correct result, 40.
With reasoning models they could choose to do it slightly differently - just tell the model what tools are available, and how to invoke them, and let it decide when to use them in more flexible fashion, rather than rely on fine tuning to follow tool use instructions.
My favorite framing: The LLM is just an ego-less extender of text documents. It is being iteratively run against movie script, which is usually incomplete and ending in: "User Says X, and Bot responds with..."
Designers of these systems have--deliberately--tricked consumers into thinking they are talking to the LLM author, rather than supplying mad-libs dialogue for a User character that is the same fictional room as a Bot character.
The Bot can only speak limitations which are story-appropriate for the character. It only says it's bad at math because lots of people have written lots of words saying the same thing. If you changed its name and description to Mathematician Dracula, it would have dialogue about how its awesome at math but can't handle sunlight, crucifixes, and garlic.
This framing also explains how "prompt injection" and "hallucinations" 3 are not exceptional, but standard core behavior.