I asked GPT-NeoX-20B a hundred arithmetic questions (opens in new tab)

(twitter.com)

175 pointsnandhinianand4y ago216 comments

216 comments

133 comments · 36 top-level

tehsauce4y ago· 13 in thread

The latest version of gpt-3 spooked me the other day with:

Me: Can here you here still here understand here what here I here am here asking here if here I here put here words here in here between?

Gpt: Yes, I can understand what you are asking if you put words in between.

Me: What was the word that I put in between?

Gpt: The word that you put in between is "here."

schleck84y ago

I tried this with the new open source GPT NeoX 20B model on GooseAI playground and it suceeded too:

Q: Can here you here still here understand here what here I here am here asking here if here I here put here words here in here between?

A: Yes.

Q: What was the word that I put in between?

A: The word was "here."

It's capable of other languages as well, although not quite as fluent.

Q: Kannst du auch Deutsch sprechen? Falls ja, was ist dein Lieblingswort?

A: Ja, ich kann Deutsch sprechen. Mein Lieblingswort ist: "Wir sind die Weltmeister!"

throwanem4y ago

This makes me wonder about a variant of the Turing test that I don't think I've seen explored. How would a conversational AI prove beyond reasonable doubt that it is an AI?

3 more replies

iggldiggl4y ago

> It's capable of other languages as well, although not quite as fluent.

Haven't had a chance to play around with this one yet, but with the smaller GPT-J model, there's a clearly noticeable difference:

In English it'll happily generate reams of text that are – at least internally – quite coherent. Any absurdity and humour mostly only comes in because the text as a whole might only have a loose connection with reality as we know it.

In German on the other hand, it comparatively much more often produces gibberish at the individual sentence level, makes up nonsense words (although they are at least indeed German-sounding), etc. Somewhat interestingly it doesn't do too bad in terms of grammar and especially orthography, though, it just often fails to turn it into a fully coherent sentence.

pwdisswordfish04y ago

Is that supposed to be a reference to “Wir Sind Die Roboter”? Feels awkward in German, usually we would say “Wir sind Weltmeister”. Also, it’s not a word, but if it were a human in casual conversation, it wouldn’t be a weird reply. Spooky stuff…

syntaxing4y ago

Did you try this on GooseAI? I was not able to replicate this

Edit: oops just noticed you mentioned GooseAI, what settings did you use?

1 more reply

guantanamo_bob4y ago

Must :clap: have :clap: learned :clap: that :clap: from :clap: all :clap: the :clap: clapping :clap: that :clap: goes :clap: on :clap:.

moffkalast4y ago

But does GPT-3 know what based is based on? Or why kids love the taste of Cinnamon Toast Crunch?

marcodiego4y ago

Scary. If it improves a bit more, people will start questioning if the machine has soul or rights.

gitfan864y ago

My kids already debate if it is wrong to tell OK Google to shutup.

1 more reply

supermdguy4y ago

It's interesting, in the forums for the beta program there have been already been a few people making posts where they're convinced that the AI is conscious. That's never really been something I've thought about much since I know a little about how it works, but I could totally see how someone who didn't have as much context for how GPT-3 works could see it as some sort of sentience.

https://community.openai.com/t/a-conversation-with-alec-a-co... https://community.openai.com/t/creepy-ai-behavior/10195 https://community.openai.com/t/where-to-watch-what-the-ai-wa...

3 more replies

jquery4y ago

Imho, not a matter of "if" but "when". I'm convinced that it will be a future civil rights battle, with young people largely on the "AI has rights" side and old people largely on the "AI has no rights" side.

1 more reply

andreyk4y ago

Might actually not be a huge feat given the transformer architecture and it's attention component - here was not that related to the rest of the sentence. Still, quite cool...

spiderfarmer4y ago

That's shockingly good.

moyix4y ago· 12 in thread

Hey! As the author of the gist, just wanted to clear up what seem to be a few misconceptions:

- This isn't GPT-3, it's the recently-released open-source and open-weights model from EleutherAI, GPT-NeoX-20B. GPT-3 is much larger (175 billion parameters vs NeoX's 20 billion).

- It's well-known that language models don't tend to be good at math by default (Gwern, among others, pointed this out back in June 2020). It seems likely that this is at least in part because of how these models currently tokenize their input (they don't represent numbers by their individual digits, but by tokens representing commonly-occurring character sequences): https://www.gwern.net/GPT-3#bpes . Someone also pointed me to this paper which looks at number representations (though it uses somewhat older models like BERT): https://arxiv.org/abs/1909.07940

- Despite the tokenization, it performs (IMO) surprisingly well at getting close to the true value, particularly for the start and end digits and the overall magnitude. You can see this by looking at the tokenization (indicated by brackets) of its guess vs the correct answer for 28531*8065 (I asked multiple times to get an idea of how consistent it is – it's not deterministic because I ran this with temperature = 0.1, which will use random sampling to get the most likely tokens):

  [What][ is][ 285][31][ *][ 80][65][?][\n][22][77][05][315]
                              Correct: [\n][23][010][25][15]
  [What][ is][ 285][31][ *][ 80][65][?][\n][22][95][01][115]
                              Correct: [\n][23][010][25][15]
  [What][ is][ 285][31][ *][ 80][65][?][\n][22][38][95][015]
                              Correct: [\n][23][010][25][15]
  [What][ is][ 285][31][ *][ 80][65][?][\n][22][99][25][015]
                              Correct: [\n][23][010][25][15]
  [What][ is][ 285][31][ *][ 80][65][?][\n][22][99][17][115]
                              Correct: [\n][23][010][25][15]

You can see that it manages to find things that are numerically close, even when no individual token is actually correct. And it compensates for different-length tokens, always picking tokens that end up with the correct total number of digits.

- Please don't use this as a calculator :) The goal in doing this was to figure out what it knows about arithmetic and see if I can understand what algorithms it might have invented for doing arithmetic, not to show that it's good or bad at math (we have calculators for that, they work fine).

shmageggy4y ago

The "algorithm" isn't too mysterious, especially in light of your observation that it does better at the beginning and end digits. It's just doing what transformers do: predicting the probability of a token given the tokens it can attend to. Assume 20B parameters is enough to memorize an addition table. Then the first digit or two is relatively predictable, as are the last, and as is the length, aka the probability of a space token. The middle tokens are less predictable. This is consistent with the result.

Furthermore, it doesn't even really need to memorize the addition table in the explicit way this suggests. Think about the probability of certain digit tokens appearing given the presence of numbers and plus signs in its data. Thus a behavior consistent with having memorized an addition table emerges from mimicking its training data.

moyix4y ago

It's a little bit more complex here because tokens are variable-length. So getting the order of magnitude (i.e. number of digits) correct requires that it be able to pick tokens for the beginning and end that have the right start/end digit, and then figure out how to make the middle the right length.

And sure, of course it emerged from mimicking (or more precisely, learning to predict the most likely next token in) its training data – that's how it was trained, it can't have emerged from anything else :) But that doesn't tell us what the higher-level algorithm represented by the weights of the network is. I'm talking about things like this for understanding an algorithm for curve detection learned by a convolutional neural network: https://distill.pub/2020/circuits/curve-circuits/

infogulch4y ago

Would it do better if you asked it to "show its work"? I.e. work it out in long form, one step at a time, like you'd ask a school kid to do. Maybe an example prompt would look like this:

    Work out 2241 + 19873.
    02241 + 19873 ~ ____4
    02241 + 19873 ~ ___14 carry 1
    02241 + 19873 ~ __114 carry 1
    02241 + 19873 ~ _2114 carry 1
    02241 + 19873 = 22114.

I'm not sure what is the best way to represent each step including details like carry digits. And you'd have to design a separate scheme for each operation.

If these models are symbol manipulators maybe the key is to break down the task into steps that are closer to being solvable with symbol manipulation.

Veedrac4y ago

I tried something like that for 3-digit multiplication with GPT-3 in another comment[1], successfully. You have to lay things out different manner than you did here, because GPT-*s have no sense of layout on a page; their byte-pair encoding destroys their ability to learn it efficiently. Further, transformers are optimized to look for things via similarity, because that's what typically occurs in text, so you're better off writing out things it can anchor off of.

There are ways to fix these issues, but BPEs micro-optimize for the primary text benchmarks that papers want good scores on so those are standard for now. I'm sure they'll get replaced eventually, once the costs outrun the wins and more scalable (alternatives to?) transformers become popular.

[1] https://news.ycombinator.com/item?id=30309302

mlb_hn4y ago

I get the tokenization argument and it may influence it a bit, but I suspect the n-digit math issue has to do more with search the way it samples (in the bpe link gwern references some experiements I'd done with improving n-digit math by chunking using commas, http://gptprompts.wikidot.com/logic:math). I think since it samples left to right on the first pass, it's not able to predict well if things carry from right to left.

I think can mitigate the search issue a bit if you have the prompt double-check itself after the fact (e.g. https://towardsdatascience.com/1-1-3-wait-no-1-1-2-how-to-ha...). Works different depending on the size of the model tho.

moyix4y ago

Yup, quite possible that this has something to do with it. There is other work showing that giving LMs a "scratchpad" for intermediate computations allows them to do much better not just at arithmetic but also things like predicting the output of some code: https://arxiv.org/abs/2112.00114

1 more reply

ravi-delia4y ago

I feel like attention would largely mitigate that, no? Has anyone looked at what the weights are while doing addition?

YeGoblynQueenne4y ago

How can a language model invent algorithms for arithmetic? How would an algorithm be represented in a language model? Isn't that the first thing to ask, before starting to look for algorithms?

For example, if I take a stroll on the beach, am I likely to see any algorithms coalescing in the grains of sand?

moyix4y ago

Invent is probably the wrong word since it implies agency, sure. Maybe "discover" or "luck upon", since whatever it's doing was formed by updating a pile of floating point weights with gradient descent?

I think it certainly makes sense to ask what the higher level "algorithm" at work here is, though. Electrons flow through wires and transistors in (say) an adder [1]; looking at the wires and transistors you won't see an algorithm for addition, but there is certainly one present, codified in the arrangement of those wires and transistors. But maybe we can reverse engineer whatever the LM is doing by a combination of probing it with experiments like these and (maybe) inspecting the learned weights. The Curve Circuits paper did this for reverse engineering a curve detector learned by a convolutional neural network: https://distill.pub/2020/circuits/curve-circuits/

I also don't mean to imply that it's a good algorithm, or one that generalizes to arbitrary numbers, etc. Maybe it's just (effectively) a lookup table and some special cases!

[1] Please don't yell at me for this metaphor, I bailed out of physics after scraping out a B- in E&M ;)

1 more reply

kordlessagain4y ago

What I did was train GPT-3 that I was asking a math question and then have it run some JavaScript to do the math with the text it thought was math formulas to get the answer.

moyix4y ago

Yeah it's interesting how moving up a level of abstraction works well here! "Write me a function that multiplies two numbers" works much better than trying to get it to multiply the numbers themselves. There's a recent-ish paper exploring this:

https://arxiv.org/abs/2108.07732

a-dub4y ago

i'd first ask it all single digit arithmetic, then two digit with and without carry and then go from there. longer strings are going to be confusing unless you're looking for pieces of them in the training data, methinks.

i suspect you could probably train a GAN to do binary or base 10 arithmetic, but have never tried or searched for papers.

spupe4y ago· 12 in thread

Thank you for this. Technically it's not GPT-3, but GPT-NeoX-20B, although they are based on a similar architecture.

The poor performance is most likely due to not having a large database of math problems to draw from. Github, for example, is part of the dataset that is used to train both GPT-3 and GPT-Neo variants, which is partly why they can generate meaningful code (sometimes). I wonder how a model finetuned for math would perform.

dang4y ago

Ok, we've reverted the title now. Thanks!

(Submitted title was 'GPT-3's answers to arithmetic questions')

williamtrask4y ago

Poor performance is more likely due to how transformer neural networks view numbers. It memorises them like words instead of modeling their numerical structure. Thus even if it’s seen the number 3456 and 3458, it knows nothing of 3457. Totally different embedding.

It’s like a kid memorising a multiplication table instead of learning the more general principle of multiplication (related: this illusion is why big models are so popular. Memorise more stuff.)

Paper (NeurIPS/DeepMind): https://arxiv.org/abs/1808.00508

Isinlor4y ago

Take a look at this paper:

Deep Symbolic Regression for Recurrent Sequences https://arxiv.org/abs/2201.04600

If you look at embedding visualization it is very clear that the model learns order of numbers.

(Interactive demo: http://recur-env.eba-rm3fchmn.us-east-2.elasticbeanstalk.com... )

There is also:

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets https://arxiv.org/abs/2201.02177

Again, looking at visualizations the model very clearly grasps the structure of the function it models.

2 more replies

nicholast4y ago

The cool thing about math applications is just how easy it would be to generate synthetic data. That these large language models haven't attempted to supplement their gigabytes+ scale data sets with such is an oversight.

1 more reply

plutonorm4y ago

It's recently been shown that even though the numbers are represented with different tokens, the network learns to form an internal representation that understands the progression from one token to the next.

1 more reply

eutectic4y ago

That depends on the tokenization scheme.

moffkalast4y ago

> this illusion is why big models are so popular. Memorise more stuff

It's all just a compressed lookup table that can handle in-betweens.

spupe4y ago

I went and checked, it turns out for this version Eleuther-AI has in fact included math problems [1]. So my earlier comment is partly incorrect.

[1] http://eaidata.bmk.sh/data/GPT_NeoX_20B.pdf

asah4y ago

And isn't it trivial to generate lots of correct sample data ? :-)

throwaway4good4y ago

No. The poor performance comes from the overall approach of using neural nets to solve basic math problems.

FL4104y ago

The cool part comes when the model can make the connection that

multiply 12345 by 87654

is the same as

def multiply_two_numbers(x, y):

return x * y

Which of course produces the desired result. The interesting part is that github copilot wrote the above with only the prompt "def multiply_two" as the prompt.

andreyk4y ago

Oh, that's a pretty big difference, would be nice if post title was altered...

d--b4y ago· 11 in thread

What? You think this is poor performance?

This totally blows my mind. I would never have guessed that GPT could get ANY of these right.

I mean, is there a data point in the dataset used to train where you can read 2241 + 19873 = 22114? Quite unlikely...

And those multiplications. It's consistently getting the number of digits right and the first two numbers correct. How the hell does this happen?

Sure, it's sometimes way off. But generally it is in the right ballpark.

I certainly think people should look into what's happening inside the model.

ggrrhh_ta4y ago

Regardless of the base, be it text or encoded numbers, text, as encoded is just a different base of representing a number('325' is '3' * (2^32) + '2' * (2^16) + '5' * 2^8 = 51 * 2^32 + 50 * 2^16 + 53 * 2^8). Neural networks can approximate polynomials, and additions/substractions & multiplications can be approximated - in this case, the base is just not '10' but 'ASCII'. I think that if you tried to train it only with arithmetic expressions (in text) it should get even better, it should just approximate the underlying operation - it does not need to understand the text.

rawoke0836004y ago

Maybe this is an example of where you need an "extra specialized skill"(arithmetic) vs the general and semi-ambiguous-skill of language+conversation.

GPT-3 is "good with conversation (language)"

GPT-3 now needs a "sub-nn-model" to do the very 'specialized skill called math'

*GPT-3 Should 'learn' to recognize which questions should be delicate to a submodel.

machiaweliczny4y ago

I think this is idea of Google Pathways (Multitude of Expert model). I mean it already works like that in every model but I think they train it differently to have it more separated.

vbuterin4y ago

What fascinates me most is that the errors are very "human-like". If you gave me multi-digit multiplication and addition problems like that, I would frequently have similar results of getting most digits right but making a mistake on one or a few of them.

Barrin924y ago

>Sure, it's sometimes way off. But generally it is in the right ballpark.

which is worse than being completely off. it just showcases how the model works, by treating mathematics like language. There are lots of examples in the dataset so similar sounding inputs produce similar sounding outputs.

This is akin to sitting in a foreign language lecture where you don't understand a single word being spoken and you try to answer questions by making similar sounding noises. While you may give an answer that sounds better than random in reality you haven't learned anything.

If these models understood mathematical laws what they would produce is arithmetic errors, like giving an answer with a wrong sign, not jumbling numbers.

charcircuit4y ago

>I mean, is there a data point in the dataset used to train where you can read 2241 + 19873 = 22114? Quite unlikely...

But there might be something like xxx1 + xxxx3 = xxxx4 in the dataset so it can learn the pattern.

qumpis4y ago

That's the astonishing bit

1 more reply

amelius4y ago

They should have asked questions like:

What is twothousandfortyone plus nineteenthousandeighthundredseventythree?

gradys4y ago

You might like the paper "Do NLP Models Know Numbers? Probing Numeracy in Embeddings": https://arxiv.org/abs/1909.07940

Neural network models seem to encode an approximate notion of quantity in their representations. This paper is pre-GPT-3, but I would think the larger training set and larger model capacity would help the model learn quantity more easily.

AndrewOMartin4y ago

It's unlikely that "2241 + 19873 = 22114" specifically is in the dataset, but very likely that there are many expressions equivalent to that expression in the dataset, and we've just picked one of those.

Imagine someone watching every lottery draw and after each draw going "Wow! the chances of those exact numbers coming up in that order are atronomical!"

dlkf4y ago

> there are many expressions equivalent to that expression in the dataset

What do you mean by this?

1 more reply

sillysaurusx4y ago· 9 in thread

If you want to play with the model, you can (with difficulty) for free at https://goose.ai/playground. You have to log in, but thankfully you can via google.

The playground crashes every minute, and the defaults ruin your outputs (temperature 1, really? 0.7 to 0.8 is a necessity, with top-k 40), and they turned off autocorrect on mobile, presumably because they hate you and your family for owning an iPad, but you can indeed play with it.

The outputs feel pretty magical, too. With the settings above, it started printing... an IRC conversation? https://gist.github.com/shawwn/9a201990196b61cd21847487185dd... This is impressive, because I'm not sure we explicitly included any IRC logs in The Pile.

re: the current title "GPT-3's answers to arithmetic questions": We've come full circle. I used to give Eleuther a hard time for confusing people. But now that people confuse themselves, they should declare victory. It's as close to success as an open source effort could hope for. And with only years of work -- not too shabby.

You can join them: https://www.eleuther.ai/faq/

GPT-NeoX-20B paper: http://eaidata.bmk.sh/data/GPT_NeoX_20B.pdf

jcims4y ago

For what it's worth you can get access to the OpenAI playground. You might have to put yourself on a waiting list.

https://beta.openai.com/playground

I tried a couple different things to pick it apart. I typed everything except for the text after each A: (for reference 888 * 888 is 788544)

    I am a math savant, I can do arithmetic in my head. 
    I will provide an answer, then show how I arrived at 
    the answer.

    Q: What is 8+8?
    A: 16
    
    Q: How did you arrive at that answer?
    A: 8+8=16. 16 is two groups of 8.
    
    Q: What is 888+888?
    A: 1776
    
    Q: How did you arrive at that answer?
    A: 888+888=1776. 1776 is two groups of 888.
    
    Q: What is 888*2?
    A: 1776
    
    Q: How did you arrive at that answer?
    A: 888*2=1776. 1776 is two groups of 888.
    
    Q: What is 888*10?
    A: 8800
    
    Q: How did you arrive at that answer?
    A: 888*10=8800. 8800 is two groups of 888*10.
    
    Q: What is 888*888?
    A: 775648
    
    Q: How did you arrive at that answer?
    A: 888*888=775648. 775648 is two groups of 888*888.
    
    Q: What is 888 groups of 888?
    A: 775648

ceejayoz4y ago

> How did you arrive at that answer?

This sort of question - "you don't remember what I just asked you?" - is one of the most glaring failings in Apple's Siri and I hope this leads to it being solved soon.

2 more replies

lucb1e4y ago

> You have to log in, but thankfully you can via google.

... thankfully you can also not login via Google.

Anyway it doesn't work even slightly. When I simply try 4+4= it prints some source code afterwards. A second attempt wrote some printf() call which computed 6+1 instead (but not giving the result, just the code). When I try what someone else shared below with a Q: and A: syntax I get this:

Q: What is 8 + 8? \n A: \n The answer looks confusing. \n How do you work out 8 + 8. What is the rule?

Q: What is 8+8? \n A:

I got the answer as 12, so it is 1+2.

(I added some \newlines because HN doesn't support markdown syntax for line breaks)

On the right are some parameters, including the model name. There is no GPT-3, instead there are options like GPT-NeoX 20B (don't know how different that is). Maybe that's the problem, or it needs a different ...temperature?

sillysaurusx4y ago

Well yes, if you do things in slightly different ways, you’ll get hugely different results. It’s a bit like saying you added bacteria to a controlled experiment on sanitization.

Try with temp 0.1, with exactly the text in the OP, including capitalization.

lucb1e4y ago

Seeing this thread was renamed: the thread was called GPT-3 before and GPT-NeoX-20B was added much later. It was not clear to me that OP was using this same model.

rapiz4y ago

Thanks for that. I've played around a little bit.

> What is 123456789 - 123456789?

> 123456788

> What is 123456789 * 0?

> 123456789

Not even near.

It didn't surprise me that the model failed to handle cases above, which are unlikely to present in the data set.

sillysaurusx4y ago

Try temp 0.1 top-k 40. For math, it matters to have an unthinkably low temperature. It’s what generated the results in the OP.

What is 12345 - 12345?

What is the distance between -0.1 and -0.01?

0.09

One neat example just now:

What is 12345 divided by 12?

4115/4

1 more reply

systemcluster4y ago

Thank you for the feedback! We're working on improving the playground and I hope those issues will be a thing of the past soon.

sillysaurusx4y ago

Love the inference speed. Best demo anywhere in that regard. Nice work.

nikolayasdf1234y ago· 9 in thread

This just shows that this model did not learn anything.

Humans do not see billions of examples to add numbers. We see just few and can apply learned notation and procedures to infinity with 100% precision.

GPT-3 learned mathematical intuition. Humans can hardly learn multiplication table over months and repetitions of same examples, and that table hardly matters at all. GPT-3 is just plainly wrong objective they trying to optimise.

dlkf4y ago

I'll preface this by saying that I am 100% in the camp that thinks these language models are neither intelligent nor a promising avenue towards understanding intelligence.

But your conclusion here is entirely wrong: the model clearly is learning something. From eyeballing this, the model is right about 10% of the time. If it were spitting out random digits the accuracy would effectively be zero. So exactly what is it learning? Is it memorising exactly equations that it saw in training? Is it learning ngram patterns that occur frequently in arithmetic equations?

I'm not an expert on these things and I'd love to hear from someone who is.

moffkalast4y ago

I think fundamentally these models compress the learning data into network weights and connections, so in effect if the learning data was 6 + 10 = 16 and 9 + 10 = 19, then you give it 7 + 10 it'll interpolate between what it's seen or something of the sort, giving you something approximately right. It's also not lossless compression so what it may have actually inside is 9 + 10 = 18 so yeah.

0xTJ4y ago

I think you're completely wrong. This shows that the model learned a lot about at-a-glance math. Sure if you sit down with pen and paper you can get the answer, but few people could do these reliably in their head. But what you can do is figure the order of magnitude, and get a rough answer for the first few digits and last digits, each with their chance of being wrong. If anything, this shows that it learned math deeper than any normal computer calculator.

jonathankoren4y ago

No. A million times no. It’s a language model. It doesn’t understand math at all. It doesn’t even understand language. All it did was spit out something that looks like math. It’s fancy automatic writing.

I’ll concede that if you tokenized the equations correctly, you might be able to get a language model to learn arithmetic, since it’s just symbol manipulation; but to make the leap that a general text model has learned anything like arithmetic is more than two bridges too far.

While deep learning language models are useful for certain cases (eg translation, and autocomplete), and are better at making superficially grammatical text than previous models; they are most emphatic my not learning anything about general concepts. They can’t even create coherent text for more than a paragraph, and even then it’s obvious they have no idea what any of the words actually mean.

These large language models are the MOST overhyped piece of AI I’ve seen in my professional career. The fact that they’re neural nets redux is just the chef’s kiss.

1 more reply

wccrawford4y ago

The vast majority of humans don't just see a few examples and figure it out. They're taught an algorithm. Eventually they may also come up with another algorithm, but they're taught one first.

They also don't have "100% precision". Many, many humans are incredibly bad at math, and even the ones that are good at it often make mistakes.

simion3144y ago

>They also don't have "100% precision". Many, many humans are incredibly bad at math,

Many humans are bad at surgery this does not mean that an AI that is slightly better then the average human is an accomplishment.

On the other hand someone could write the algorithms for math and teach an AI when and how to use it. The rules of math are clear you don't need a bad search algorithm to approximate them for a extremely limited subset of inputs.

varjag4y ago

GPT3 surely has several algorithms for addition in its training corpus. Just unable to make good use of them.

arghwhat4y ago

I think you'd find that most people doing large number math in their head is also off by a few percent like this model.

Sure, with pen and paper we can follow specific algorithms manually to very slowly get a precise result. If we wanted a computer to merely follow instructions, then I suspect that there are better ways...

jonathankoren4y ago

You’re really lowering the bar for success here. It’s now unreasonable for a computer to correctly add two numbers together? Give me a break. It wasn’t even reasonable for a Pentium chip to incorrectly divide two numbers back in 1994.

1 more reply

mordymoop4y ago· 8 in thread

When you toss “2241 + 19873 =” into an applet that shows you the default tokenization scheme GPT-3 uses, you get this:

(224)(1)( +)( 198)(73)( =)

I’ve heard it remarked before that, while tokenization is obviously an unavoidable part of a model with an architecture like GPT, this is a very silly way of tokenizing number strings for the purposes of learning or doing arithmetic. Indeed, I think a lot of GPT-3’s puzzling edge-case performance can be ascribed to weird and unhelpful tokenizations. Just imagine if you were forced to learn arithmetic with a brain that automatically categorized “224” as a sort of distinct object, or, for that matter, breaking down 19873 as ( 198)(73) rather than (19873) or (1)(9)(8)(7)(3) or anything practically useful.

The thing is that we can, in a sense, learn better “tokenizations”, in the sense that a 4 year old learning to read sees letters, while a 40 year old reading a novel “sees” whole words or even groups of words. The GPT architecture can’t change its tokenization scheme.

vbuterin4y ago

When I do mental arithmetic my brain frequently tokenizes into digit pairs or triples if I can recognize pairs and triples that have specific properties.

"224" is actually a really nice object to recognize because it's 7 * 32, and if you can recognize other multiples of 32 it frequently gives you shortcuts. It's less useful for addition because you would need to get lucky and get a multiple of 32 (or 7) on both sides, but for multiplication and division it helps a lot.

mordymoop4y ago

Sure - I think we all learn tricks like that. But you learned that pattern of tokenization, it wasn't arbitrarily foisted on you.

What GPTs have to deal with is more like, you are fed an arithmetic problem via colored slips of paper, and you just have to remember that this particular shade of chartreuse means "224", which you happen to have memorized equals 7 * 32, etc., but then the next slip of paper is off-white which means "1", and now you have to mentally shift everything ...

1 more reply

dr_zoidberg4y ago

I think that part of why the tokenization is a proble for math here is that it doesn't seem to be carrying overflow into the left token. Anyway, I haven't worked with GPT in detail to do a deeper analysis than that hunch, so take my comment with a couple of salt grains.

mannykannot4y ago

maybe this is a clue to which ones it succeeds on, and how it goes wrong when it does not.

jcims4y ago

Whoa, that explains why only .5% of the examples have an incorrect last digit.

starfallg4y ago

It seems that we need another layer to tokenize according to context. I can see that breaking up a long number into 3 or 4 digits is the correct behaviour if we are dealing with phone numbers, but it'd be completely wrong if it's nearly anything else.

baalimago4y ago

Many words have several semantic definitions depending on definition. This is why the word "is" is a very good token to have in a vocabulary (as an example), since it can mean so much depending on what tokens came before and after it.

Numbers have very limited semantic value. "123816" only means that number, and it's used very rarely in comparison to basically any other word (and the higher the number, the less chance of use, statistically peaking).

So the question becomes; to what extent do you expand the vocabulary using only numbers? "1", "2", "3", ... "1000000" would probably be a huge waste of words in an AI vocabulary (1MB input nodes), yet still not very impressive arithmetically even with 100% calculationrate. In comparison, a hand calculator from 30 years ago could do this with ease. It's not a question of being able to cleverly tokenize.

Calculations like this is an inherent flaw of vocabulary based AI until the semantic meaning of number sequences are somehow taught to it. Basically it needs to understand that "12" and "1" + "2" has the same contextular meaning, something which very rarely is explained in anything but 7 year old's schoolbooks. The problem is the dataset.

MauranKilom4y ago

Gwern noted that adding things like thousands separators and $ signs to the input makes GPT significantly better at math.

nandhinianandOP4y ago· 3 in thread

And here's the twitter thread with some plots about this data ( https://twitter.com/moyix/status/1491803929801150471 )

dang4y ago

I've changed the URL to that from https://gist.github.com/moyix/ca4091f16f0b5011bfa8f3f97f705a.... Thanks!

pvg4y ago

You should have posted that, since it's the original source. Without that context, and with the small mistake you made in the title, most of the commenters here ended up talking about something that this actually isn't.

nandhinianandOP4y ago

Sorry about that I was a bit not at my best. And haven't checked this thread till now.

samuelfekete4y ago· 3 in thread

A really intelligent AI would just use a calculator.

visarga4y ago

A combo of LM + REPL would be the winning ticket. The language model generates a piece of code, executes it, observe results. Is the answer found? No, maybe it cannot be solved by one single iteration, continue until solution found or max number of iterations reached. Single step responses are limited by their nature, and computation is hard to do "in your head"/"in your LM".

The REPL could also come with libraries for all the important algorithms. Thus the LM could use numerical solvers and other nice toys.

Another excellent module we could augment LM's with is a search engine - already done, proven to improve the efficiency of the LM by 25x (see link). Another one would be a physics simulator.

https://deepmind.com/research/publications/2021/improving-la...

We're just starting with language models. They will improve tremendously.

dumdumdumdum4y ago

But a really really intelligent AI would try to run them through a neural net and be amazed when it got most of them approximately correct.

can16358p4y ago

I'd really want to see in a super complicated AI, given many questions, devising a way to compile a basic calculator program that obeys the simple math rules (e.g only addition and multiplication) for itself inline and calculate the thrown questions using that instead of training math data.

rexreed4y ago· 3 in thread

Amazing how a very costly to train system using billions of neural nodes on millions of dollars of compute performs more poorly than an 8-bit 1970s pocket calculator.

Not sure why people are expecting some sort of "intelligence" to emerge from a text generator model trained on Internet corpus data. GPT-3 doesn't calculate, it pattern matches.

I do get why people might be surprised, on the other hand, that it actually doesn't perform worse than indicated here. Maybe it's surprising upside. But since we know that the GPT is a transformer model, what it is doing is applying a probabilistic best-fit. From this perspective I can see how it is best-fitting data in ways that can provide these sorts of results, especially given all that training data.

Samin1004y ago

According to Kahneman, when a chess pro makes an “intuitive” and unexplainable move it’s just pattern recognition happening subconsciously.

If language models like GPT-3 are “just” pattern recognizers, wouldn’t that makes them capable of intuition?

1 more reply

andreyk4y ago

What's your definition of 'intelligence'? Many of the things GPT-3 does clearly exhibit intelligence (just not human level intelligence).

1 more reply

habitue4y ago

> what it is doing is applying a probabilistic best-fit

I think you're underselling probabalistic best-fits. Especially with all of the regularization going on in training.

Robin_Message4y ago· 2 in thread

It seems to me that carries are where this trips up. Which is weirdly human. I wonder if there are enough examples to learn each digit pair addition or subtraction, but not enough to learn every contextual action.

wildmanx4y ago

Not really "human". Doing no-carry addition is much easier for a machine to do as well, as that's basically what XOR does, i.e., SIMD. Carry introduces dependencies between the digits, potentially as long as the whole string goes. So that's pretty hard to understand, also for a machine.

mannykannot4y ago

At first I thought you were saying that doing arithmetic by carrying is not really a human trait, but on reflection, I think you are saying that carrying methods are inherently mistake-prone, regardless of who or what is using them.

I feel it would be a very big deal if GPT-3 (or this variant) was carrying, even if imperfectly, but other comments here seem to be suggesting that, on account of the way all input is tokenized, consistently doing arithmetic by carrying would simply be outside of the set of transformations it could perform (though some results that look like it might arise by chance.)

dash24y ago· 2 in thread

Arithmetic seems like an example where it would help to learn from the real world, not just from text. I learnt to add up by watching my teacher manipulate plastic Lego-style blocks. Put 3 blocks with 2 blocks, and you have 1, 2, 3, 4, 5.

londons_explore4y ago

But somewhere in that massive corpus of text will be a description just like you've just given...

tsimionescu4y ago

Sure, but GPT-3 doesn't attach semantics to text, it just learns how to produce text patterns that are similar to text patterns it has seen before.

1 more reply

wildmanx4y ago· 2 in thread

I'd like to see doing this with random people on the street and then compare performance. You may be surprised.

NateEag4y ago

Not really.

Average person on the street is going to correctly say "geez, I dunno. Can I use my phone?"

If you don't forbid them to, then they'll whip it out and get 98% correct (I figure they'll typo a few).

This model didn't have enough understanding to do that (since it literally has no understanding at all).

wildmanx4y ago

I don't know where you live, but 98% correct is not what would happen around here.

Edit: Oh, 98% _with_ a calculator. What if you force them to do it by hand?

1 more reply

marcodiego4y ago· 2 in thread

I fear the day AI will give superhuman consistent correct answers and nobody will be able to determine why it is right or how the correct answer was found.

Maybe someday we'll get an answer from a machine which superhumanly mostly correct and we'll be unable to tell if it is right or wrong. If it is a question whose answer will influence important decisions, considering the machine answer will be close to a form of religion.

jjoonathan4y ago

Like religion, I suspect you will have many different machine answers to choose from.

olyjohn4y ago

I suspect we will have 2 of them. We will start out with lots of them, but then a couple of them will start making the most money, and resort to underhanded tactics, bribery, and lobbying, and put the others out of business.

1 more reply

nlh4y ago· 1 in thread

I discovered something like this in real-world usage. I’ve have GitHub Copilot running in VSCode and I’ve been experimenting with how it works when doing plain text accounting (using ledger / hledger).

The ledger files are somewhat “code”-like so it’s been super interesting to see how it works. The short answer: it works really quite well! ..except for the math part :)

I have a long ledger of transactions, and I can now give Copilot a comment like: “Jan 1, 2022 +100 from consulting income” and it (GPT-3) will generate a nearly perfect ledger entry, debiting from income and crediting the right bank account.

But the arithmetic is always wrong (ledger has an option for you to keep a running balance as a check). There’s the occasional moment where it gets the balance adjustment correct, but almost every time the results are similar to this post.

supermdguy4y ago

I also have Copilot running, and I was surprised when it had pretty good autosuggestions when writing proofs in Latex! A lot of times it has subtle logical errors in the proof, but the syntax is always correct. And there have been a few times when it gives a sentence or two that's exactly right

jcims4y ago· 1 in thread

These language models feel, to me, like the unfiltered self. If someone asked me what 838+1283 was my head would instantly offer up some number 2301 or something. But i would discard that number because I learned in elementary school that I don't come up with good values, I need to execute a process in order to get the right value.

I imported the csv version and I'm no statistician but 90% percentile relative error is 8.6%, which is something like this:

What is 22730 - 24978? -2448 (real answer -2248)

That's totally within range of something that would plop into my head...with one exception. Of 1000 entries, only five have an incorrect last digit. I think that's meaningful...it almost tells me that there's a multi-stage operation happening in there somewhere.

visarga4y ago

> These language models feel, to me, like the unfiltered self.

A generator-critic framework with multiple rounds of iteration would improve on the limitations of the LM.

Veedrac4y ago· 1 in thread

I fiddled with this prompt

    Calculate 721*436

    Step A: Multiplication table.
    721*0=0 721*1=721 721*2=1,442 721*3=2,163
    721*4=2,884 721*5=3,605 721*6=4,326
    721*7=5,047 721*8=5,768 721*9=6,489
    Step B: Multiply digits
    436=4 hundreds + 3 tens + 6 ones
    721*4 hundreds=2,884*100=288,400
    721*3 tens=2,163*10=21,630
    721*6 ones=4,326*1=4,326
    Step C: Add the first two
    288,400+21,630=310,030
    Then add the last one
    310,030+4,326=314356

To get GPT-3's answer to this one correct (this is its answer, but it took some trial and error with the prompt to get it to stop making small mistakes)

    Calculate 312*481

    Step A: Multiplication table.
    312*0=0 312*1=312 312*2=624 312*3=936
    312*4=1,248 312*5=1,560 312*6=1,872
    312*7=2,184 312*8=2,496 312*9=2,808
    Step B: Multiply digits
    481=4 hundreds + 8 tens + 1 ones
    312*4 hundreds=1,248*100=124,800
    312*8 tens=2,496*10=24,960
    312*1 ones=312*1=312
    Step C: Add the first two
    124,800+24,960=149,760
    Then add the last one
    149,760+312=150,072

I then tested it with two random 3-digit numbers

    Calculate 669*306

    Step A: Multiplication table.
    669*0=0 669*1=669 669*2=1,338 669*3=2,007
    669*4=2,676 669*5=3,345 669*6=4,014
    669*7=4,683 669*8=5,352 669*9=6,021
    Step B: Multiply digits
    306=3 hundreds + 0 tens + 6 ones
    669*3 hundreds=2,007*100=200,700
    669*0 tens=0*10=0
    669*6 ones=4,014*1=4,014
    Step C: Add the first two
    200,700+4,014=204,714
    Then add the last one
    204,714+306=205,020

204,714 is correct, but then it gets confused in the very last line because of the zero.

Anyway this, I think, proves that GPT-3 can do zero-shot math if it doesn't have to do it in its head all at once. Is it good for a calculator? No. Is it good for a language model trained on web text? Yes, absolutely.

E: Correcting the start of Step C in that last example to “200,700+0”, and replacing “XYZ=X hundreds + Y tens + Z ones” with “XYZ=X Y Z=X hundreds + Y tens + Z ones” allowed it to do 145*585, 961*761 and 592*555 correctly in a row, all randomly chosen, and at least the last two tried without changes to the prompt. I consider this an adequate test, and it demonstrates GPT-3's algorithm following abilities. As GPT-3 is still a tiny model, this seems important to note.

E2: To be clear this is still nowhere near 100% successful. GPT-3 still makes a lot of errors. I ran 100 tries of a slightly different prompt through the API, and got a success rate of 42%.

infogulch4y ago

> can do zero-shot math if it doesn't have to do it in its head all at once

Very interesting! This is what I would expect. It can run a symbolic algorithm fine, just give it some scratch space to work out the intermediate results. I feel like there's a very large space to optimize the layout "algorithm" -- like how you adjusted step c -- to produce reliable results.

baalimago4y ago· 1 in thread

I'll be impressed when the AI consults with an ordinary calculator for the correct answer

edouard-harris4y ago

This already exists: Google's recently-published LaMDA dialogue model [1] is trained to consult a calculator for arithmetic questions and consistently succeeds at it.

[1] https://arxiv.org/abs/2201.08239v2

2bitlobster4y ago· 1 in thread

Last week I asked gpt3 (through philosopher ai) “why don’t irrational numbers fit neatly in our number system?”

And I got back this bad answer:

<<<

Would you tell me what an irrational number is? Is it a number that cannot be written as a fraction, like pi or e? Or is it something else?

Okay, I see. Thanks for clarifying that.

Then what is an irrational number? It is a number that cannot be written as a fraction, correct?

Okay, I understand. Then what is a number that can be written as a fraction? It must be a rational number, right?

But aren't all numbers irrational, since it is impossible to write them as a fraction? And any rational number must be irrational, right?

>>>

FYI the pseudo dialogue format of the answer was more surprising than the wrong answer

jrockway4y ago

Yeah sometimes when you're doing a Q&A with it it starts generating questions, and then you become the thing answering. It's fun.

dvh4y ago· 1 in thread

Motorola: How. Much. Is. One. Plus. One?

Pentium: 3

Motorola: That. Is. Not. Correct.

Pentium: no, but it's fast!

moffkalast4y ago

Quik mafs

dang4y ago

Questions and results here: https://gist.github.com/moyix/ca4091f16f0b5011bfa8f3f97f705a....

We changed the above URL from that to the link which gives the background, but both are worth a look.

paraschopra4y ago

I tried putting numbers as words and it did additions perfectly. Pretty magical!

What is fifty plus ninety? 140

What is fifty plus ninety one? 141

What is fifty minus ninety one? -41

What is minus fifty minus ninety one? -141

Although it failed in multiplication or adding longer numbers (as words).

FL33TW00D4y ago

Can we get a title edit? GPT3 is over 8x the parameter count.

mlb_hn4y ago

Couple things there where you can see if it improves with the prompt/formatting. E.g. with Davinci (and J a bit but didn't test too much) you can get bette results by:

  - Using few-shot examples of similar length to the targets (e.g. 10 digit math, use 10 digit few shots)

  - Chunking numbers with commas

  - Having it double check itself

and here it's not doing any of those things.

ur-whale4y ago

Yup, glorified, extremely large, lookup table, but has zero clues of how the world works.

Or to put it another way: if I haven't seen the result before I'm very likely to get it wrong.

And inferring the rules of arithmetic from examples : nope.

We're all very impressed with the large language models because human language was always very hard to get a handle one.

Now that we've made progress in manipulating natural language, we are very quick to believe that the darn thing is somehow "intelligent" because we've always attached some sort o mystical quality to language.

This demonstrates that the capacity of these systems to reason and craft high order rules from enough examples is exactly nil.

xkapastel4y ago

Can someone change the title? This is GPT-NeoX-20B, not GPT-3.

rapiz4y ago

How large is the data set they're training on? I suspect there are many math equations including these or similar numbers. As the result shows, the model is generally right about the first a few digits, but frequently wrong about the last digit. This may due to the fact that the data set can hardly cover the exact numbers in the questions, but it's likely to cover the first a few digits.

can16358p4y ago

Hmmm. Some quick observation: on especially large 9-ish digit numbers, when it make very few digits correct, the correct ones are mostly including the very first and very last digits. Something remarkably similar to how us humans remember numbers and words: when we make mistake we generally remember first and last digits/letters but mess up the middle.

andreyk4y ago

Fun fact, the original GPT-3 paper has a whole section on arithmetic as part of its evaluation - https://arxiv.org/abs/2005.14165

Btw as per another comment this is GPT-NeoX-20B,not GPT-3 ; somewhat important distinction

bobuk4y ago

And now we need to do a contest and compare how humans will answer on this questions, and whose answer is closer to the truth. Take my bet - humans will lose or will be on the same level of guessing.

tehsauce4y ago

Title says gpt-3, but gist says gpt-neo-20b which was recently open sourced but only 1/10 the size of gpt3.

musingsole4y ago

The page lists the average percent error, but I was interested in the median percent error: 1.20%

marcodiego4y ago

I'm still impressed that for the first 1000 tests it gets close most of the time.

svilen_dobrev4y ago

how about hex? 0xAA + 0x23 ? or, if it has "seen" low-level driver code, 0xAA & 0x23 ?

a-dub4y ago

is there a training data browser with a search engine somewhere for gpt-3?

elcapitan4y ago

Maybe we're already past singularity and the AI is simply pretending to be bad at this in order to avoid making humans feel insecure. /s

1 more reply

j / k navigate · click thread line to collapse

216 comments

133 comments · 36 top-level

tehsauce4y ago· 13 in thread

The latest version of gpt-3 spooked me the other day with:

Me: Can here you here still here understand here what here I here am here asking here if here I here put here words here in here between?

Gpt: Yes, I can understand what you are asking if you put words in between.

Me: What was the word that I put in between?

Gpt: The word that you put in between is "here."

schleck84y ago

I tried this with the new open source GPT NeoX 20B model on GooseAI playground and it suceeded too:

Q: Can here you here still here understand here what here I here am here asking here if here I here put here words here in here between?

A: Yes.

Q: What was the word that I put in between?

A: The word was "here."

It's capable of other languages as well, although not quite as fluent.

Q: Kannst du auch Deutsch sprechen? Falls ja, was ist dein Lieblingswort?

A: Ja, ich kann Deutsch sprechen. Mein Lieblingswort ist: "Wir sind die Weltmeister!"

throwanem4y ago

This makes me wonder about a variant of the Turing test that I don't think I've seen explored. How would a conversational AI prove beyond reasonable doubt that it is an AI?

3 more replies

iggldiggl4y ago

> It's capable of other languages as well, although not quite as fluent.

Haven't had a chance to play around with this one yet, but with the smaller GPT-J model, there's a clearly noticeable difference:

pwdisswordfish04y ago

syntaxing4y ago

Did you try this on GooseAI? I was not able to replicate this

Edit: oops just noticed you mentioned GooseAI, what settings did you use?

1 more reply

guantanamo_bob4y ago

Must :clap: have :clap: learned :clap: that :clap: from :clap: all :clap: the :clap: clapping :clap: that :clap: goes :clap: on :clap:.

moffkalast4y ago

But does GPT-3 know what based is based on? Or why kids love the taste of Cinnamon Toast Crunch?

marcodiego4y ago

Scary. If it improves a bit more, people will start questioning if the machine has soul or rights.

gitfan864y ago

My kids already debate if it is wrong to tell OK Google to shutup.

1 more reply

supermdguy4y ago

https://community.openai.com/t/a-conversation-with-alec-a-co... https://community.openai.com/t/creepy-ai-behavior/10195 https://community.openai.com/t/where-to-watch-what-the-ai-wa...

3 more replies

jquery4y ago

1 more reply

andreyk4y ago

Might actually not be a huge feat given the transformer architecture and it's attention component - here was not that related to the rest of the sentence. Still, quite cool...

spiderfarmer4y ago

That's shockingly good.

moyix4y ago· 12 in thread

Hey! As the author of the gist, just wanted to clear up what seem to be a few misconceptions:

- This isn't GPT-3, it's the recently-released open-source and open-weights model from EleutherAI, GPT-NeoX-20B. GPT-3 is much larger (175 billion parameters vs NeoX's 20 billion).

  [What][ is][ 285][31][ *][ 80][65][?][\n][22][77][05][315]
                              Correct: [\n][23][010][25][15]
  [What][ is][ 285][31][ *][ 80][65][?][\n][22][95][01][115]
                              Correct: [\n][23][010][25][15]
  [What][ is][ 285][31][ *][ 80][65][?][\n][22][38][95][015]
                              Correct: [\n][23][010][25][15]
  [What][ is][ 285][31][ *][ 80][65][?][\n][22][99][25][015]
                              Correct: [\n][23][010][25][15]
  [What][ is][ 285][31][ *][ 80][65][?][\n][22][99][17][115]
                              Correct: [\n][23][010][25][15]

shmageggy4y ago

moyix4y ago

infogulch4y ago

Would it do better if you asked it to "show its work"? I.e. work it out in long form, one step at a time, like you'd ask a school kid to do. Maybe an example prompt would look like this:

    Work out 2241 + 19873.
    02241 + 19873 ~ ____4
    02241 + 19873 ~ ___14 carry 1
    02241 + 19873 ~ __114 carry 1
    02241 + 19873 ~ _2114 carry 1
    02241 + 19873 = 22114.

I'm not sure what is the best way to represent each step including details like carry digits. And you'd have to design a separate scheme for each operation.

If these models are symbol manipulators maybe the key is to break down the task into steps that are closer to being solvable with symbol manipulation.

Veedrac4y ago

[1] https://news.ycombinator.com/item?id=30309302

mlb_hn4y ago

moyix4y ago

1 more reply

ravi-delia4y ago

I feel like attention would largely mitigate that, no? Has anyone looked at what the weights are while doing addition?

YeGoblynQueenne4y ago

How can a language model invent algorithms for arithmetic? How would an algorithm be represented in a language model? Isn't that the first thing to ask, before starting to look for algorithms?

For example, if I take a stroll on the beach, am I likely to see any algorithms coalescing in the grains of sand?

moyix4y ago

I also don't mean to imply that it's a good algorithm, or one that generalizes to arbitrary numbers, etc. Maybe it's just (effectively) a lookup table and some special cases!

[1] Please don't yell at me for this metaphor, I bailed out of physics after scraping out a B- in E&M ;)

1 more reply

kordlessagain4y ago

What I did was train GPT-3 that I was asking a math question and then have it run some JavaScript to do the math with the text it thought was math formulas to get the answer.

moyix4y ago

https://arxiv.org/abs/2108.07732

a-dub4y ago

i suspect you could probably train a GAN to do binary or base 10 arithmetic, but have never tried or searched for papers.

spupe4y ago· 12 in thread

Thank you for this. Technically it's not GPT-3, but GPT-NeoX-20B, although they are based on a similar architecture.

dang4y ago

Ok, we've reverted the title now. Thanks!

(Submitted title was 'GPT-3's answers to arithmetic questions')

williamtrask4y ago

It’s like a kid memorising a multiplication table instead of learning the more general principle of multiplication (related: this illusion is why big models are so popular. Memorise more stuff.)

Paper (NeurIPS/DeepMind): https://arxiv.org/abs/1808.00508

Isinlor4y ago

Take a look at this paper:

Deep Symbolic Regression for Recurrent Sequences https://arxiv.org/abs/2201.04600

If you look at embedding visualization it is very clear that the model learns order of numbers.

(Interactive demo: http://recur-env.eba-rm3fchmn.us-east-2.elasticbeanstalk.com... )

There is also:

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets https://arxiv.org/abs/2201.02177

Again, looking at visualizations the model very clearly grasps the structure of the function it models.

2 more replies

nicholast4y ago

1 more reply

plutonorm4y ago

1 more reply

eutectic4y ago

That depends on the tokenization scheme.

moffkalast4y ago

> this illusion is why big models are so popular. Memorise more stuff

It's all just a compressed lookup table that can handle in-betweens.

spupe4y ago

I went and checked, it turns out for this version Eleuther-AI has in fact included math problems [1]. So my earlier comment is partly incorrect.

[1] http://eaidata.bmk.sh/data/GPT_NeoX_20B.pdf

asah4y ago

And isn't it trivial to generate lots of correct sample data ? :-)

throwaway4good4y ago

No. The poor performance comes from the overall approach of using neural nets to solve basic math problems.

FL4104y ago

The cool part comes when the model can make the connection that

multiply 12345 by 87654

is the same as

def multiply_two_numbers(x, y):

return x * y

Which of course produces the desired result. The interesting part is that github copilot wrote the above with only the prompt "def multiply_two" as the prompt.

andreyk4y ago

Oh, that's a pretty big difference, would be nice if post title was altered...

d--b4y ago· 11 in thread

What? You think this is poor performance?

This totally blows my mind. I would never have guessed that GPT could get ANY of these right.

I mean, is there a data point in the dataset used to train where you can read 2241 + 19873 = 22114? Quite unlikely...

And those multiplications. It's consistently getting the number of digits right and the first two numbers correct. How the hell does this happen?

Sure, it's sometimes way off. But generally it is in the right ballpark.

I certainly think people should look into what's happening inside the model.

ggrrhh_ta4y ago

rawoke0836004y ago

Maybe this is an example of where you need an "extra specialized skill"(arithmetic) vs the general and semi-ambiguous-skill of language+conversation.

GPT-3 is "good with conversation (language)"

GPT-3 now needs a "sub-nn-model" to do the very 'specialized skill called math'

*GPT-3 Should 'learn' to recognize which questions should be delicate to a submodel.

machiaweliczny4y ago

I think this is idea of Google Pathways (Multitude of Expert model). I mean it already works like that in every model but I think they train it differently to have it more separated.

vbuterin4y ago

Barrin924y ago

>Sure, it's sometimes way off. But generally it is in the right ballpark.

If these models understood mathematical laws what they would produce is arithmetic errors, like giving an answer with a wrong sign, not jumbling numbers.

charcircuit4y ago

>I mean, is there a data point in the dataset used to train where you can read 2241 + 19873 = 22114? Quite unlikely...

But there might be something like xxx1 + xxxx3 = xxxx4 in the dataset so it can learn the pattern.

qumpis4y ago

That's the astonishing bit

1 more reply

amelius4y ago

They should have asked questions like:

What is twothousandfortyone plus nineteenthousandeighthundredseventythree?

gradys4y ago

You might like the paper "Do NLP Models Know Numbers? Probing Numeracy in Embeddings": https://arxiv.org/abs/1909.07940

AndrewOMartin4y ago

Imagine someone watching every lottery draw and after each draw going "Wow! the chances of those exact numbers coming up in that order are atronomical!"

dlkf4y ago

> there are many expressions equivalent to that expression in the dataset

What do you mean by this?

1 more reply

sillysaurusx4y ago· 9 in thread

If you want to play with the model, you can (with difficulty) for free at https://goose.ai/playground. You have to log in, but thankfully you can via google.

You can join them: https://www.eleuther.ai/faq/

GPT-NeoX-20B paper: http://eaidata.bmk.sh/data/GPT_NeoX_20B.pdf

jcims4y ago

For what it's worth you can get access to the OpenAI playground. You might have to put yourself on a waiting list.

https://beta.openai.com/playground

I tried a couple different things to pick it apart. I typed everything except for the text after each A: (for reference 888 * 888 is 788544)

    I am a math savant, I can do arithmetic in my head. 
    I will provide an answer, then show how I arrived at 
    the answer.

    Q: What is 8+8?
    A: 16
    
    Q: How did you arrive at that answer?
    A: 8+8=16. 16 is two groups of 8.
    
    Q: What is 888+888?
    A: 1776
    
    Q: How did you arrive at that answer?
    A: 888+888=1776. 1776 is two groups of 888.
    
    Q: What is 888*2?
    A: 1776
    
    Q: How did you arrive at that answer?
    A: 888*2=1776. 1776 is two groups of 888.
    
    Q: What is 888*10?
    A: 8800
    
    Q: How did you arrive at that answer?
    A: 888*10=8800. 8800 is two groups of 888*10.
    
    Q: What is 888*888?
    A: 775648
    
    Q: How did you arrive at that answer?
    A: 888*888=775648. 775648 is two groups of 888*888.
    
    Q: What is 888 groups of 888?
    A: 775648

ceejayoz4y ago

> How did you arrive at that answer?

This sort of question - "you don't remember what I just asked you?" - is one of the most glaring failings in Apple's Siri and I hope this leads to it being solved soon.

2 more replies

lucb1e4y ago

> You have to log in, but thankfully you can via google.

... thankfully you can also not login via Google.

Q: What is 8 + 8? \n A: \n The answer looks confusing. \n How do you work out 8 + 8. What is the rule?

Q: What is 8+8? \n A:

I got the answer as 12, so it is 1+2.

(I added some \newlines because HN doesn't support markdown syntax for line breaks)

sillysaurusx4y ago

Well yes, if you do things in slightly different ways, you’ll get hugely different results. It’s a bit like saying you added bacteria to a controlled experiment on sanitization.

Try with temp 0.1, with exactly the text in the OP, including capitalization.

lucb1e4y ago

Seeing this thread was renamed: the thread was called GPT-3 before and GPT-NeoX-20B was added much later. It was not clear to me that OP was using this same model.

rapiz4y ago

Thanks for that. I've played around a little bit.

> What is 123456789 - 123456789?

> 123456788

> What is 123456789 * 0?

> 123456789

Not even near.

It didn't surprise me that the model failed to handle cases above, which are unlikely to present in the data set.

sillysaurusx4y ago

Try temp 0.1 top-k 40. For math, it matters to have an unthinkably low temperature. It’s what generated the results in the OP.

What is 12345 - 12345?

What is the distance between -0.1 and -0.01?

0.09

One neat example just now:

What is 12345 divided by 12?

4115/4

1 more reply

systemcluster4y ago

Thank you for the feedback! We're working on improving the playground and I hope those issues will be a thing of the past soon.

sillysaurusx4y ago

Love the inference speed. Best demo anywhere in that regard. Nice work.

nikolayasdf1234y ago· 9 in thread

This just shows that this model did not learn anything.

Humans do not see billions of examples to add numbers. We see just few and can apply learned notation and procedures to infinity with 100% precision.

dlkf4y ago

I'll preface this by saying that I am 100% in the camp that thinks these language models are neither intelligent nor a promising avenue towards understanding intelligence.

I'm not an expert on these things and I'd love to hear from someone who is.

moffkalast4y ago

0xTJ4y ago

jonathankoren4y ago

These large language models are the MOST overhyped piece of AI I’ve seen in my professional career. The fact that they’re neural nets redux is just the chef’s kiss.

1 more reply

wccrawford4y ago

The vast majority of humans don't just see a few examples and figure it out. They're taught an algorithm. Eventually they may also come up with another algorithm, but they're taught one first.

They also don't have "100% precision". Many, many humans are incredibly bad at math, and even the ones that are good at it often make mistakes.

simion3144y ago

>They also don't have "100% precision". Many, many humans are incredibly bad at math,

Many humans are bad at surgery this does not mean that an AI that is slightly better then the average human is an accomplishment.

varjag4y ago

GPT3 surely has several algorithms for addition in its training corpus. Just unable to make good use of them.

arghwhat4y ago

I think you'd find that most people doing large number math in their head is also off by a few percent like this model.

jonathankoren4y ago

1 more reply

mordymoop4y ago· 8 in thread

When you toss “2241 + 19873 =” into an applet that shows you the default tokenization scheme GPT-3 uses, you get this:

(224)(1)( +)( 198)(73)( =)

vbuterin4y ago

When I do mental arithmetic my brain frequently tokenizes into digit pairs or triples if I can recognize pairs and triples that have specific properties.

mordymoop4y ago

Sure - I think we all learn tricks like that. But you learned that pattern of tokenization, it wasn't arbitrarily foisted on you.

1 more reply

dr_zoidberg4y ago

mannykannot4y ago

maybe this is a clue to which ones it succeeds on, and how it goes wrong when it does not.

jcims4y ago

Whoa, that explains why only .5% of the examples have an incorrect last digit.

starfallg4y ago

baalimago4y ago

MauranKilom4y ago

Gwern noted that adding things like thousands separators and $ signs to the input makes GPT significantly better at math.

nandhinianandOP4y ago· 3 in thread

And here's the twitter thread with some plots about this data ( https://twitter.com/moyix/status/1491803929801150471 )

dang4y ago

I've changed the URL to that from https://gist.github.com/moyix/ca4091f16f0b5011bfa8f3f97f705a.... Thanks!

pvg4y ago

nandhinianandOP4y ago

Sorry about that I was a bit not at my best. And haven't checked this thread till now.

samuelfekete4y ago· 3 in thread

A really intelligent AI would just use a calculator.

visarga4y ago

The REPL could also come with libraries for all the important algorithms. Thus the LM could use numerical solvers and other nice toys.

Another excellent module we could augment LM's with is a search engine - already done, proven to improve the efficiency of the LM by 25x (see link). Another one would be a physics simulator.

https://deepmind.com/research/publications/2021/improving-la...

We're just starting with language models. They will improve tremendously.

dumdumdumdum4y ago

But a really really intelligent AI would try to run them through a neural net and be amazed when it got most of them approximately correct.

can16358p4y ago

rexreed4y ago· 3 in thread

Amazing how a very costly to train system using billions of neural nodes on millions of dollars of compute performs more poorly than an 8-bit 1970s pocket calculator.

Not sure why people are expecting some sort of "intelligence" to emerge from a text generator model trained on Internet corpus data. GPT-3 doesn't calculate, it pattern matches.

Samin1004y ago

According to Kahneman, when a chess pro makes an “intuitive” and unexplainable move it’s just pattern recognition happening subconsciously.

If language models like GPT-3 are “just” pattern recognizers, wouldn’t that makes them capable of intuition?

1 more reply

andreyk4y ago

What's your definition of 'intelligence'? Many of the things GPT-3 does clearly exhibit intelligence (just not human level intelligence).

1 more reply

habitue4y ago

> what it is doing is applying a probabilistic best-fit

I think you're underselling probabalistic best-fits. Especially with all of the regularization going on in training.

Robin_Message4y ago· 2 in thread

wildmanx4y ago

mannykannot4y ago

dash24y ago· 2 in thread

londons_explore4y ago

But somewhere in that massive corpus of text will be a description just like you've just given...

tsimionescu4y ago

Sure, but GPT-3 doesn't attach semantics to text, it just learns how to produce text patterns that are similar to text patterns it has seen before.

1 more reply

wildmanx4y ago· 2 in thread

I'd like to see doing this with random people on the street and then compare performance. You may be surprised.

NateEag4y ago

Not really.

Average person on the street is going to correctly say "geez, I dunno. Can I use my phone?"

If you don't forbid them to, then they'll whip it out and get 98% correct (I figure they'll typo a few).

This model didn't have enough understanding to do that (since it literally has no understanding at all).

wildmanx4y ago

I don't know where you live, but 98% correct is not what would happen around here.

Edit: Oh, 98% _with_ a calculator. What if you force them to do it by hand?

1 more reply

marcodiego4y ago· 2 in thread

I fear the day AI will give superhuman consistent correct answers and nobody will be able to determine why it is right or how the correct answer was found.

jjoonathan4y ago

Like religion, I suspect you will have many different machine answers to choose from.

olyjohn4y ago

1 more reply

nlh4y ago· 1 in thread

The ledger files are somewhat “code”-like so it’s been super interesting to see how it works. The short answer: it works really quite well! ..except for the math part :)

supermdguy4y ago

jcims4y ago· 1 in thread

I imported the csv version and I'm no statistician but 90% percentile relative error is 8.6%, which is something like this:

What is 22730 - 24978? -2448 (real answer -2248)

visarga4y ago

> These language models feel, to me, like the unfiltered self.

A generator-critic framework with multiple rounds of iteration would improve on the limitations of the LM.

Veedrac4y ago· 1 in thread

I fiddled with this prompt

    Calculate 721*436

    Step A: Multiplication table.
    721*0=0 721*1=721 721*2=1,442 721*3=2,163
    721*4=2,884 721*5=3,605 721*6=4,326
    721*7=5,047 721*8=5,768 721*9=6,489
    Step B: Multiply digits
    436=4 hundreds + 3 tens + 6 ones
    721*4 hundreds=2,884*100=288,400
    721*3 tens=2,163*10=21,630
    721*6 ones=4,326*1=4,326
    Step C: Add the first two
    288,400+21,630=310,030
    Then add the last one
    310,030+4,326=314356

To get GPT-3's answer to this one correct (this is its answer, but it took some trial and error with the prompt to get it to stop making small mistakes)

    Calculate 312*481

    Step A: Multiplication table.
    312*0=0 312*1=312 312*2=624 312*3=936
    312*4=1,248 312*5=1,560 312*6=1,872
    312*7=2,184 312*8=2,496 312*9=2,808
    Step B: Multiply digits
    481=4 hundreds + 8 tens + 1 ones
    312*4 hundreds=1,248*100=124,800
    312*8 tens=2,496*10=24,960
    312*1 ones=312*1=312
    Step C: Add the first two
    124,800+24,960=149,760
    Then add the last one
    149,760+312=150,072

I then tested it with two random 3-digit numbers

    Calculate 669*306

    Step A: Multiplication table.
    669*0=0 669*1=669 669*2=1,338 669*3=2,007
    669*4=2,676 669*5=3,345 669*6=4,014
    669*7=4,683 669*8=5,352 669*9=6,021
    Step B: Multiply digits
    306=3 hundreds + 0 tens + 6 ones
    669*3 hundreds=2,007*100=200,700
    669*0 tens=0*10=0
    669*6 ones=4,014*1=4,014
    Step C: Add the first two
    200,700+4,014=204,714
    Then add the last one
    204,714+306=205,020

204,714 is correct, but then it gets confused in the very last line because of the zero.

E2: To be clear this is still nowhere near 100% successful. GPT-3 still makes a lot of errors. I ran 100 tries of a slightly different prompt through the API, and got a success rate of 42%.

infogulch4y ago

> can do zero-shot math if it doesn't have to do it in its head all at once

baalimago4y ago· 1 in thread

I'll be impressed when the AI consults with an ordinary calculator for the correct answer

edouard-harris4y ago

This already exists: Google's recently-published LaMDA dialogue model [1] is trained to consult a calculator for arithmetic questions and consistently succeeds at it.

[1] https://arxiv.org/abs/2201.08239v2

2bitlobster4y ago· 1 in thread

Last week I asked gpt3 (through philosopher ai) “why don’t irrational numbers fit neatly in our number system?”

And I got back this bad answer:

<<<

Would you tell me what an irrational number is? Is it a number that cannot be written as a fraction, like pi or e? Or is it something else?

Okay, I see. Thanks for clarifying that.

Then what is an irrational number? It is a number that cannot be written as a fraction, correct?

Okay, I understand. Then what is a number that can be written as a fraction? It must be a rational number, right?

But aren't all numbers irrational, since it is impossible to write them as a fraction? And any rational number must be irrational, right?

>>>

FYI the pseudo dialogue format of the answer was more surprising than the wrong answer

jrockway4y ago

Yeah sometimes when you're doing a Q&A with it it starts generating questions, and then you become the thing answering. It's fun.

dvh4y ago· 1 in thread

Motorola: How. Much. Is. One. Plus. One?

Pentium: 3

Motorola: That. Is. Not. Correct.

Pentium: no, but it's fast!

moffkalast4y ago

Quik mafs

dang4y ago

Questions and results here: https://gist.github.com/moyix/ca4091f16f0b5011bfa8f3f97f705a....

We changed the above URL from that to the link which gives the background, but both are worth a look.

paraschopra4y ago

I tried putting numbers as words and it did additions perfectly. Pretty magical!

What is fifty plus ninety? 140

What is fifty plus ninety one? 141

What is fifty minus ninety one? -41

What is minus fifty minus ninety one? -141

Although it failed in multiplication or adding longer numbers (as words).

FL33TW00D4y ago

Can we get a title edit? GPT3 is over 8x the parameter count.

mlb_hn4y ago

Couple things there where you can see if it improves with the prompt/formatting. E.g. with Davinci (and J a bit but didn't test too much) you can get bette results by:

  - Using few-shot examples of similar length to the targets (e.g. 10 digit math, use 10 digit few shots)

  - Chunking numbers with commas

  - Having it double check itself

and here it's not doing any of those things.

ur-whale4y ago

Yup, glorified, extremely large, lookup table, but has zero clues of how the world works.

Or to put it another way: if I haven't seen the result before I'm very likely to get it wrong.

And inferring the rules of arithmetic from examples : nope.

We're all very impressed with the large language models because human language was always very hard to get a handle one.

This demonstrates that the capacity of these systems to reason and craft high order rules from enough examples is exactly nil.

xkapastel4y ago

Can someone change the title? This is GPT-NeoX-20B, not GPT-3.

rapiz4y ago

can16358p4y ago

andreyk4y ago

Fun fact, the original GPT-3 paper has a whole section on arithmetic as part of its evaluation - https://arxiv.org/abs/2005.14165

Btw as per another comment this is GPT-NeoX-20B,not GPT-3 ; somewhat important distinction

bobuk4y ago

And now we need to do a contest and compare how humans will answer on this questions, and whose answer is closer to the truth. Take my bet - humans will lose or will be on the same level of guessing.

tehsauce4y ago

Title says gpt-3, but gist says gpt-neo-20b which was recently open sourced but only 1/10 the size of gpt3.

musingsole4y ago

The page lists the average percent error, but I was interested in the median percent error: 1.20%

marcodiego4y ago

I'm still impressed that for the first 1000 tests it gets close most of the time.

svilen_dobrev4y ago

how about hex? 0xAA + 0x23 ? or, if it has "seen" low-level driver code, 0xAA & 0x23 ?

a-dub4y ago

is there a training data browser with a search engine somewhere for gpt-3?

elcapitan4y ago

Maybe we're already past singularity and the AI is simply pretending to be bad at this in order to avoid making humans feel insecure. /s

1 more reply

j / k navigate · click thread line to collapse