I agree with all key points:
* There are problems that are easy for human beings but hard for current LLMs (and maybe impossible for them; no one knows). Examples include playing Wordle and predicting cellular automata (including Turing-complete ones like Rule 110). We don't fully understand why current LLMs are bad at these tasks.
* Providing an LLM with examples and step-by-step instructions in a prompt means the user is figuring out the "reasoning steps" and handing them to the LLM, instead of the LLM figuring them out by itself. We have "reasoning machines" that are intelligent but seem to be hitting fundamental limits we don't understand.
* It's unclear if better prompting and bigger models using existing attention mechanisms can achieve AGI. As a model of computation, attention is very rigid, whereas human brains are always undergoing synaptic plasticity. There may be a more flexible architecture capable of AGI, but we don't know it yet.
* For now, using current AI models requires carefully constructing long prompts with right and wrong answers for computational problems, priming the model to reply appropriately, and applying lots of external guardrails (e.g., LLMs acting as agents that review and vote on the answers of other LLMs).
* Attention seems to suffer from "goal drift," making reliability hard without all that external scaffolding.
Go read the whole thing.
I thought we did know for things like playing Wordle, that its because they deal with words as sequence of tokens that correspond to whole words not sequences of letters, so a game that involves dealing with sequences of letters constrained to those that are valid words doesn’t match the way they process information?
> Providing an LLM with examples and step-by-step instructions in a prompt means the user is figuring out the “reasoning steps” and handing them to the LLM, instead of the LLM figuring them out by itself. We have “reasoning machines” that are intelligent but seem to be hitting fundamental limits we don’t understand.
But providing examples with different, contextually-appropriate sets of reasoning steps results can enable the model to choose its own, more-or-less appropriate, set of reasoning steps for particular questions not matching the examples.
> It’s unclear if better prompting and bigger models using existing attention mechanisms can achieve AGI.
Since there is no objective definition of AGI or test for it, there’s no basis for any meaningful speculation on what can or cannot achieve it; discussions about it are quasi-religious, not scientific.
I'm sure one could train an LLM to be awesome at Wordle, but from an AGI perspective the fact that you'd have to do so proves it's not a path to AGI. The Wordle dominating LLM would presumably be perplexed by the next clever word game until trained on thinking about information that way, while a human doesn't need to absorb billions of examples to figure it out.
I was originally pretty bullish on LLMs, but now I'm equally convinced that while they probably have some interesting applications, they're a dead-end from a legitimate AGI perspective.
This is such a weird thing to say. Essentially _all_ scientific ideas are, at least to begin with, poorly defined. In fact, I'd argue that almost all scientific ideas remain poorly defined with the possible exception of _some_ of the basic concepts in physics. Scientific progress cannot be and is not predicated upon perfect definitions. For some reason when the topic of consciousness or AGI comes up around here, everyone commits a sort of "all or nothing" logical fallacy: absence of perfect knowledge is cast as total ignorance.
Wordle and cellular automata are very 2D, and LLMs are fundamentally 1D. You might think "but what about Chess!" - except Chess is encoded extremely often as a 1D stream of tokens to notate games, and bound to be highly represented in LLMs' training sets. Wordle and cellular automata are not often, if ever, encoded as 1D streams of tokens - it's not something an LLM would be experienced with even if they had a reasonable "understanding" of the concepts. Imagine being an OK chess player, being asked to play a game blindfolded dictating your moves purely via notation, and being told you suck.
> Providing an LLM with examples and step-by-step instructions in a prompt means the user is figuring out the "reasoning steps" and handing them to the LLM, instead of the LLM figuring them out by itself. We have "reasoning machines" that are intelligent but seem to be hitting fundamental limits we don't understand.
You have probably heard of this really popular game called Bridge before, right? You might even be able to remember tons of advice your Grandma gave you based on her experience playing it - except she never let you watch it directly. Is Grandma "figuring out the game" for you when she finally sits down and teaches you the rules?
Visual transformers are gaining traction and they are 100% focus in 2d data.
Also as an aside, I'll add that transformers can be seen as a kind of "RNN" that grows its hidden state with each new token in the input context. I wonder if we will end up needing some new kind of "RNN" that can grow or shrink its hidden state and also access some kind of permanent memory as needed at each step.
We sure live in interesting times!
This is common, and commonly called retrieval augmented generation, or RAG.
edit: I did not pay attention to the link. It is about Wu et al's "Memorizing Transformers", which contain an internal memory.
In complete seriousness, can anyone can explain why LLMs are good at some tasks?
They can come up with excellent (or excellent-looking-but-wrong) answers to any question that their training corpus covers. In a gross oversimplification, the "reasoning" they do is really just parroting a weighted average (with randomness injected) of the matching training data.
What they're doing doesn't really match any definition of "understanding." An LLM (and any current AI) doesn't "understand" anything; it's effectively no more than a really big, really complicated spreadsheet. And no matter how complicated a spreadsheet gets, it's never going to understand anything.
Not until we find the secret to actual learning. And increasingly it looks like actual learning probably relies on some of the quantum phenomena that are known to be present in the brain.
We may not even have the science yet to understand how the brain learns. But I have become convinced that we're not going to find a way for digital-logic-based computers to bridge that gap.
An LLM isnt a model of human thinking.
An LLM is an attempt to build a simulation of human communication. An LLM is to language what a forecast is to weather. No amount of weather data is actually going to turn that simulation into snow, no amount of LLM data is going to create AGI.
That having been said, better models (smaller, more flexible ones) are going to result in a LOT of practical uses that have the potential to make our day to day lives easier (think digital personal assistant that has current knowledge).
That doesn’t mean we won’t end up approximating one eventually, but it’s going to take a lot of real human thinking first. For example, ChatGPT writes code to solve some questions rather than reasoning about it from text. The LLM is not doing the heavy lifting in that case.
Give it (some) 3D questions or anything where there isn’t massive textual datasets and you often need to break out to specialised code.
Another thought I find useful is that it considers its job done when it’s produced enough reasonable tokens, not when it’s actually solved a problem. You and I would continue to ponder the edge cases. It’s just happy if there are 1000 tokens that look approximately like its dataset. Agents make that a bit smarter but they’re still limited by the goal of being happy when each has produced the required token quota, missing eg implications that we’d see instantly. Obviously we’re smart enough to keep filling those gaps.
Similar to how CNNs are so successful at image recognition, because they also roughly follow the way we do it too.
Other seq-2-seq language approaches work too, but not as good as Transformers, which I'd guess is due to transformers better matching our own inductive biases, maybe due to the specific form of attention.
Like how we explain human doing tasks -- they are evolved to do that.
I believe this is a non-answer, but if we are satisfied with that non answer for human, why not LLMs?
Tasks are specialised for using the training corpus, the attention mechanisms, the loss functions, and such.
I'll leave it to others to expand on actual answers, but IMO focusing on transfer learning helps to understand how an LLM does inferences.
Just using the term 'General' doesn't seem to communicate anything useful about the nature of intelligence.
One thing an LLM _also_ doesn't bring to the table is an opinion. We can push it in that direction by giving it a role ("you are an expert developer" etc), but it's a bit weak.
If you give an LLM an easy task with minimal instructions it will do the task in the most conventional, common sense fashion. And why shouldn't it? It has no opinion, your prompt doesn't give it an opinion, so it just does the most normal-seeming thing. If you want it to solve the task in any other way then you have to tell it to do so.
I think a hard task is similar. If you don't tell the LLM _how_ to solve the hard task then it will try to approach it in the most conventional, common sense way. Instead of just boring results for a hard task the result is often failure. But hard problems approached with conventional common sense will often result in failures! Giving the LLM a thought process to follow is a quick education on how to solve the problem.
Maybe we just need to train the LLM on more problem solving? And maybe LLMs worked better when they were initially trained on code for exactly that reason, it's a much larger corpus of task-solving examples than is available elsewhere. That is, maybe we don't talk often enough and clearly enough about how to solve natural language problems in order for the models to really learn those techniques.
Also, as the author talks about in the article with respect to agents, the inability to rewind responses may keep the LLM from addressing problems in the ways humans do, but that can also be addressed with agents or multi-prompt approaches. These approaches don't seem that impressive in practice right now, but maybe we just need to figure it out (and maybe with better training the models themselves will be better at handling these recursive calls).
And yes, of course, that opinion is going to be the "average" of what their training data is, but why is that a surprise? Humans don't come with innate opinions, either - the ones that we end up having are shaped by our upbringing, both the broad cultural aspects of it and specific personal experiences. To the extent an LLM has either, it's the training process, so of course that shapes the opinions it will exhibit when not prompted to do anything else.
Now the fact that you can "override" this default persona of any LLM so trivially by prompting it is IMO stronger evidence that it's not really an identity. But that, I think, is also a function of their training - after all, that training basically consists of completing a bunch of text representing many very different opinions. In a very real sense, we're training models to assume that opinions are fungible. But if you take a model and train it specifically on e.g. writings of some philosophical school, and it will internalize those.
Rather than asking why LLMs can’t do these tasks, maybe one should ask why we’d expect them to be able to in the first place? Do we fully understand why, for example, a cat can’t predict cellular automata? What would such an explanation look like?
I know there are some who will want to immediately jump in with scathing disagreement, but so far I’ve yet to see any solid evidence of LLMs being capable of reasoning. They can certainly do surprising and impressive things, but the kind of tasks you’re talking about require understanding, which, whilst obviously a very thorny thing to try and define, doesn’t seem to have much to do with how LLMs operate.
I don’t think we should be at all surprised that super-advanced autocorrect can’t exhibit intelligence, and we should spend our time building better systems rather than wondering why what we have now doesn’t work. It’ll be obvious in a few years (or perhaps decades) from now that we just had totally the wrong paradigm. It’s frankly bonkers to think you’re ever going to get a pure LLM to be able to do these kind of things with any degree of reliability just by feeding it yet more data or by ‘prompting it better’.
That's quite a statement.
I think what it tells us is that our cognition is capable of more than just language modeling. With LLMs we are discovering (amazing) capabilities and the limits of language models. While language models can do incredible things with language that humans can't, they still can't do something simple like sudoku. But there are neural networks, CNNs and RNNs that can solve sudoku better than humans can. I think that the thing to learn here is that some problems are in the domain of language models, and some problems are a better fit for other forms of cognition. The human brain is amazing in that it combines several forms of cognition in an integrated way.
One thing that I think LLMs have the capability to do is to integrate several types of systems and to choose the right one to solve a problem. Teach an LLM how to interface with a CNN that solves sudoku problems, and then ask it a sudoku problem.
It seems to me that if we want to create an AGI, we need to learn how to integrate several different types of models, and teach them how to distribute the tasks we give them to the correct models.
A relatively simple graph theory algorithm can solve it (and at multiple orders of magnitude fewer calculations). Even a naive brute force search is considered tractable, considering the problem size. Although, search could be considered one of the AI tools in your proposed toolbox.
Prompt: Predict which type of algorithm would be effective to solve sudoku.
Response: A backtracking algorithm is typically best for solving Sudoku puzzles due to its efficiency in exploring all possible number placements systematically until it finds the correct solution.
...seemed to work well enough for me.
Prompt 2: Which type of neural network is most efficient at solving sudoku?
Response 2: Convolutional Neural Networks (CNNs) are particularly effective for solving Sudoku puzzles. They can capture the spatial hierarchies in the grid by processing parts of the grid as images, making them efficient for this type of puzzle-solving task.
...Seems to me that LLMs have no problem with this task.
For a review of this topic, I'd suggest: https://nessie.ilab.sztaki.hu/~kornai/2023/Hopf/Resources/st...
The authors of this review have themselves written several articles on the topic, and there is also empirical evidence connected to these limitations.
What Algorithms can Transformers Learn? A Study in Length Generalization https://arxiv.org/abs/2310.16028
With both empirical and theoretical support I find it's pretty clear this is an obvious limitation.
Anyway, this list looks more like a "hold my beer" moment for AI researchers than any fundamental objections for AIs to stop evolving any further. Sure there are weaknesses, and paths to address those. Anyone claiming that this is the end of the road in terms of progress is going to be in for some disappointing reality check probably a lot sooner than is comfortable.
And of course by narrowing it to just LLMs, the authors have a bit of an escape hatch because they conveniently exclude any further architectures, alternate strategies, improvements, that might otherwise overcome the identified current weaknesses. But that's an artificial constraint that has no real world value; because of course AI researchers are already looking beyond the current state of the art. Why wouldn't they.
Don't think anyone worth their salt would look at this and think : oh well that's that then.
A large enough llm with memory is turning complete.
So theoretically I don’t think there is anything they can never do.
Common misunderstanding of the universal approximation theorem.
Consider this: can an mlp approximate a sine wave?
> A large enough llm with memory is turning complete.
With (a lot of) chain of thought it could be.
Read the paper, and its references.
This just isn't true. Human training is multimodal to a degree far beyond even the most capable multimodal model, so human babies arguably see more data by a young age than all models collectively have seen.
Not to mention that human babies don't even start as a blank slate as LLMs do, billions of years of evolution have formed the base model described by our DNA.
From there, remember the text is usually a reflection of things in the real world. Understanding those things in non-textual ways both gives meaning to and deeper understanding of the text. Much of the text itself was even stored in other modes, like markup or PDF’s, whose structure tells us things about it.
That we learn multimodal from birth is therefore an important point to make.
It might also be a prerequisite for AGI. It could be one of the fundamental laws of information theory or something. Text might not be enough like how digital devices need analog to interface with the real world.
In the meantime, it took me about 2 minutes and 0 guesswork to write a straightforward and readable solution in 15 lines of Python. This i know for sure will work 100% of the time and not cost $1 per inference.
Reminds me about some early attempts to have executable requirements specifications or model-based engineering. Turns out, expressing the problem is half the problem, resulting in requirements often longer and more convoluted than the code that implements them, code being a very efficient language to express solutions and all their edge cases, free from ambiguity.
Don't get me wrong here, LLMs are super useful for certain class of questions. The boundaries of what it can not do need to be understood better, to keep the AI-for-everything hype at bay.
As amazing as they are, they still have many limitations.
I’ve been working with ChatGPT and Gemini to apply simple rules like the one above and I got so frustrated.
I can’t fault llms for not knowing what to do here because I, a human, have no idea what on earth this means.
or at least that's my understanding of the prompt
I’m not a native English speaker, how would you write it?
FWIW the LLMs get it right many times, but fail other times.
But if you are writing about a large number, particularly one with a lot of different digits, prefer writing the digits: "A mile is 5,280 feet." Compare that to: "A mile is five thousand, two hundred, and eighty feet."
Given the many reaponses, it would be fun to aee if llm beat humans on understanding the sentence ahah
There are two blue, one red, and 15 green m&ms in this bag.
You could get this behavior implemented perfectly with constrained text gen techniques like grammars or any of the various libraries implementing constrained text gen (i.e. guidance)
I could think of how to use them to prevent the LLM from generating digits for numbers greater than ten by using a regex plus a constraint that forbids digits, but the main problem is the other part of the rule, i.e. numbers above 10 should never be spelled out and should be written as digits instead. For that I presume you need to identify the spelled out numbers first, for which you presumably would need the LLM so you're back to LLM fallibility.
Any pointers would be greatly appreciated.
https://www.youtube.com/watch?v=99NPzteAz94
This is my first post on HN in 10 years.
Is source available?
I would love to play with this
Are they layer 2 solutions like Lightning is to bitcoin?
I have honestly spent multiple hours trying to word the prompt so it will stop including introductory phrases and just include the fact straight. I have gone so far as forbid the words “interesting”, “notable” and a few others in the prompt, and it just ignores me and uses them anyway.
I’ve tried all the different available models and 4 will once in a great while get it right. 3, 3 turbo, and even 4 turbo basically never work as I want.
Overall, it seems pretty bad at being told not to do certain things. For instance there are a number of topics I don’t want mentioned in its response, but telling it not to seems to only increase the likelihood that it will mention it, like you put the idea in its head.
---
transcript:
$ ollama run llama3:8b
>>> tell me an interesting fact about etymology
Here's one:
Did you know that the word "quarantine" has a fascinating etymological history?
The word "quarantine" comes from^C
>>> tell me an interesting fact about etymology. just the fact, don't mention it's interesting.
The word "robot" originated from the Czechoslovakian word "robota," which means "forced labor" or "drudgery." This term was used in Karel Čapek's 1920 play "R.U.R." (Rossum's Universal Robots), where it referred to artificial workers.
Is the response interesting because you don't know it? —How does it know this?
Is the response interesting because you do know it or might have offered it yourself? —Confirmation bias.
Is it interesting because a lot of training references contextualize it as "interesting?" —Begged question.
Is it contextually interesting? —What is the context? A robot refers to robots? How unexpected...
Is it interesting within the narrow confines of LLM adaptations to a scope of inputs?
Can their by any more damning claim of the general suitability of the technology as an oracle than different users using the same prompts and getting inexplicably contrary results?
If trivial prompt alignments result in appropriate vs inappropriate responses, this destroys confidence for every response.
What am I missing?
If you spent that much time I'm sure you tried this and other things, so maybe even that isn't enough. (Though I assume if you ask for a JSON/function call response with the API that you'd do fine...?)
Hence, we do need powerful and less censored LLMs if we want to better integrate LLMs into applications.
You can negative prompt any LLM with stuff like "always write the word interesting in your response".
You can also use techniques for modifying logprobs of tokens, which is avaialble in gpt-4 api (but is hard to use). You can literally ban "interesting" from its vocabulary.
You could even use representation steering techniques to do this using control vectors. See this library as an example: https://github.com/Hellisotherpeople/llm_steer-oobabooga
Me:
For each of the following relational statements, flip the relation around and generate a new statement or question answer pair.
Input: Bluey's father is Bandit
Output: Bandit's daughter is Bluey
Input: Neil Armstrong was the first person to step on the Moon.
Output: Who was the first person to step on the Moon? Neil Armstrong.
Now for the real statements:
- Valentina Tereshkova was the first woman to travel to space
- Who is Mary Lee Pfeiffer's son? Tom Cruise
ChatGPT:
Here are the flipped relational statements:
Valentina Tereshkova was the first woman to travel to space
Output: Who was the first woman to travel to space? Valentina Tereshkova.
Who is Mary Lee Pfeiffer's son? Tom Cruise
Output: Tom Cruise's mother is Mary Lee Pfeiffer.
[0]: https://chat.openai.com/share/33eb3ee4-2094-4748-b01e-0967af...
If you think about it, they're making an extraordinary claim given how fluently LLMs engage in language and reasoning.
https://andrewmayne.com/2023/03/29/how-to-play-wordle-with-g...
The LLM isn't solving the actual problem, it's solving a subset problem.
Now before anyone gets too caught up with objecting to this notion, I would seriously suggest that you spend time with observing children from new-born to 2 years.
I have been observing my latest granddaughter sine her birth about 16 months ago and thinking about every public LLM system current;y available.
There is an insight here to be obtained and that insight is in the nature of real intelligence.
On the whole, no-one actually knows what intelligence is or what sentience is or what it means to be cognitively conscious. There is still much research going on and nothing actually definitive has come forth yet. We really are at the beginning in terms of studying these areas.
We can certainly produce some incredible systems, but none of them are intelligent per se. Solutions to certain kinds of problems can be achieved using these systems and there are researchers who are seriously looking at incorporating these systems into CAS and theorem provers. These systems though only provide an augmentation service for a person as does every mechanical system we use
But there is an essential component necessary for the use of all LLMs which many seem to not be cognisant of and that is these systems, to be useful, require humans to be involved.
The questions we have to ask ourselves is: what can we use these systems for and do these uses provide benefits in some way or can these systems be abused by various parties in obtaining control over others?
There are benefits and there are abuses. Can we do better or will we do worse by using them?
The scenario you described is possible if you started at the South Pole. If you travel 24,901 miles east from there, you would indeed end up back at the same spot because all lines of longitude converge at the poles. However, there are no tigers in Antarctica.
Tigers are native to Asia, not Antarctica. The closest tiger species to Antarctica would be the Siberian tiger, found in parts of Russia, China, and North Korea, but they are still thousands of miles away from Antarctica.
So, while the travel scenario is theoretically possible, the presence of the tiger is not realistic in this context. It seems like an imaginative or hypothetical situation rather than a real-world one.
(instead of the answer mentioned in the article)
Would you? How can you travel east at all, starting from the South Pole? There is no east. Every direction you face is north.
OTOH, every line of latitude whose length is an integral divisor of 24,901 miles will satisfy the problem.
It probably hallucinated the answer from that.
Just making a transformer bigger and bigger, and feeding it more and more data, will not change it from being a language model into something else, anymore than scaling up an expert system such as Cyc will transform it into something other than an expert system. "Scale it up and it'll become sentient" is one of the recurring myths of AI.. a bit odd that people are falling for it again.
As an aside, it seems reasonable to consider an LLM as a type of expert system - one that has a broad area of expertise (like Cyc), including (unlike Cyc) how to infer rules from language and generate language from rules.
If you want to create a brain-like AGI, then you need an entire cognitive architecture, not just one piece of it which is what we have currently with LLMs. Compared to a brain, an LLM is maybe just like the cortex (without all the other brain parts like cerebellum, hippocampus, hypothalamus and interconnectivity such as the cortico-thalamic loop). It's as if we've cut the cortex out of a dead person's brain, put it in a mason jar to keep it alive, and hooked it's inputs and outputs up to a computer. Feed words in, get words out. Cool, but it's not a whole brain, it's a cortex in a mason jar.
This is a pretty empty claim when we don't know what the limits of language modelling are. Of course it will never not be a language model. But the question is what are the limits of capability of this class of computing device?
How come? Note my emphasis on the 2nd 'L'. I'm not saying that there are things that AI models will never be able to do, I'm saying that there are things that Large Language Models will be unable to do.
Training LLMs is often argued to be analogous to human learning, most often as a defence against claims of copyright infringement by arguing that human creativity is also based on training from copyrighted materials. However, that is a red herring.
The responses from ever more powerful LLMs are indeed impressive, and beyond what an overwhelming majority of us believed possible just 5 years ago. They are nearing and sometimes surpassing the performance of educated humans in certain areas, so how come I can argue they are limited? Consider it from the other side: how come an educated human can create something as good as an LLM can when said human's brain has been "trained" on an infinitesimal fraction of the material which was used to train even the 1st release of ChatGPT?
That is because LLMs do not learn nor reason like humans: they do not have opinions, do not have intentions, do not have doubts, do not have curiosity, do not have values, do not have a model of mind — they have tokens and probabilities.
For an AI model to be able to do certain things that humans can do it needs to have many of those human characteristics that allow us to do impressive mental feats having absorbed barely any training material (compared to LLMs) and being virtually unable to even remember most of it, let alone verbatim. Such an AI model is surely possible, but it needs a completely different paradigm from straightforward LLMs. That's not to say however that a Language Model will almost certainly be an necessary module of such an AI, but it will not be sufficient.
Also doubt is just uncertainty, and can be represented as a probability. Actually all values and everything can be presented as a numerical probability, which I personally prefer to do as well.
I don't think it should. It's more interesting to know what LLMs will _never_ be able to do (if anything).
The systems are hugely successful and popular, and this naturally leads to a massive interest in LLM's as the next step. They are incredibly tools, but they are based on probability and while they're lucky enough to be useful for almost everything. Decision making probably shouldn't be one of them. Similarly ML is incredibly helpful in things like cancer detection , but we've already had issues where they got things wrong and because MBA's don't really know how they work, they were used as a replacement instead of an enhancement for the human factor. I'm fairly certain we're going to use LLM's for a lot of things where we shouldn't, and probably never should. I'm not sure we can avoid it, but I wouldn't personally trust them to do any sort of function which will have a big influence on peoples lives. I use both Co-pilot and OpenAI's tools extensively, but I can still prompt them with the same thing and get extremely different quality outputs, and while this will improve, and while it's very to get an output that's actually useful, it's still a major issue that might never get solved well enough for what we're going to ask of the models way before they are ready.
I hope we're going to be clever enough to only use them as enhancement tools in the vital public sector, but I'm sure we're going to use them in areas like education. Which is going to be interesting... We already see this with new software developers in my area of the world, where they build things with the use of LLM's, things that work, but aren't build "right" and will eventually cause issues. For the most part this doesn't matter, but you really don't want the person designing your medical software to use a LLM.
We don't need to create custom AGI for every domain, we just need a model/tool catalog and an agent that is able to reason well enough to decompose problems into parts that can be farmed out to specialized tools then reassembled to form an answer.
Curiously, the need to do this well - handling the quadratic complexity of a shifting set of human relationships, grudges, and alliances - is thought to be one of the things that led is to higher levels of intelligence.
Answering questions about what you're describing sounds really interesting. What would a training set be like that describes a bunch of complex human relationships and then asks questions about them with objective answers?
One difference, as in all ml training, is interactivity. Looking at ape studies, knowing the relationships is partly diagnostic, but it's also about planning and competition. And that competitive/adaptive aspect is what is what looks like a real evolutionary driver. If you can understand, navigate, and manipulate relationships successfully, you get more mating opportunities. Doing /that/ well involves both reasoning and long term planning, both of which are apparent in chimps.
A good book on this topic is 'are we smart enough to understand how smart animals are' by Frans de Waal.
https://wwnorton.com/books/Are-We-Smart-Enough-to-Know-How-S...
It can’t do anything which implies a large or infinite token space (eg video understanding).
It’s also limited to a reasonable response length since token selection is probabilistic at each recursion. The longer you make it the more likely it is to veer off course.
Produce convincingly bad digital paintings.
I teach digital painting. Some of the students have incorporated AI into their working process, which I support. Others have tried to cheat by simply copying AI generated output. Such cases are super-easy to spot: they carry the visual signature of AI art (which are mostly scrappings from artstation). This visual signature seems impossible to override. If only there was a way that AI could produce digital images bad enough to pass as genuine student output.
my favorite example is when a great dancer who tries to mockingly mimic a bad dancer still looks good and does it hilariously
I agree. This reminds me of the so-called school of kung fu called drunken master. There can be a can't-give-a-fuck about someone who is at the peak of their abilities.
Third paragraph. The entire article is based on the premise LLMs are supposed to reason, which is wrong. They don't, they're tools to generate text.
That's like saying "humans aren't supposed to reason, they're supposed to make sounds with their mouths".
I don't know... you're still describing a talking parrot here, if you'd ask me.
It's definitely a tokenizer issue, if GPT-4 was trained on singular characters I'm pretty sure it would be able to play Wordle much better. GPT-4 as they are trained today have quite lossy knowledge about the characters inside a specific token, probably a fix would be to embed the knowledge inside the embeddings.
You can argue about how this doesn't really say anything surprising since the reversal of "A is B" is literally "B is A", but it's weird to expect elegant solutions to all problems on all fronts all at once, and we do have an incredibly simple data generation process here.
The underlying point this article might be that LLMs don't understand the non-textual aspects of a grid. Which is a fair point, they model language, not space. I wouldn't expect text corpuses to explain space either, since possibly literally everyone who can read and write already knows a lot about spatial layouts.
LLMs rely on the statistical dependencies between words or parts of words. That means any question you ask, which is hard to determine from that statistical dependency is extremely hard for an AI. E.g. ChatGPT fails at determining the length of words made up of random characters. It will fail at even performing the simplest of rules because encoding the rules in the statistical dependencies is extremely hard.
This issue, or at least similar ones, absolutely is due to tokenization issues.
Karpathy is right that nearly every modern problem with LLMs is due to tokenization, but if you don't believe him, maybe see this work by gwern: https://gwern.net/gpt-3#bpes or this work by yours truly: https://aclanthology.org/2022.cai-1.2/
Try this: represent all those ASCII representations of games with the letter Q replacing the newline, to properly convert the encoding into a representation approximating what LLMs "see" (not a table, but a stream interspersed with Qs at a regular interval). Pretty hard right?
> LLMs cannot reset their own context
If you have a model hooked up to something agentic, I don't see why it couldn't perform context manipulation on itself or even selective realtime finetuning. Think you'll need info for the long haul, kick off some finetuning. Think you'd rather have one page of documentation in context than other, swap them out in one iteration. When you call LLMs over APIs you usually provide the entire context with each invocation...
> Devin
It's not that it's massively smarter or agentic, just that it has the opportunity to correct its mistakes rather than committing to the first thing to come out of it (and is being handheld by a vastly more knowledgable SWE in its demos). You see cherrypicked examples (I also work on GenAI-for-coding) - just like a tragically incompetent employee could waste literal years on a project diligently plugging away at some task, so too can agentic models go off on a wild goose chase that accomplishes nothing besides making Nvidia more money. Just because something is highly persistent doesn't mean it will "converge" on a correct outcome.
That's the summary of "The book of Why" in one sentence as I understand it.
In particular, the line about "what models can't do tells us what they don't know" is infuriating.
No, that's not the case at all. At least in a number of instances, what they can't do is because of what they do know.
As an example, one of thecan'ts I got from HN a year ago for GPT-4 was a variation of a classic logic puzzle. And indeed, the model can't solve it - nor can most major models since.
But it's not because the model can't solve the logic - it's because the token similarity to the standard form biases the output towards the standard solution. A hack as simple as changing the nouns to emojis can allow the model to get the correct answer and work through the logic successfully every attempt because it breaks that similarity bias.
People are way too confident around a topic where what's 'known' is more mercurial than maybe any field since 1930s particle physics.
I'd strongly recommend deleting 'never' or 'can't' from one's vocabularies on the subject unless one enjoys ending up with egg on their faces.
To compare to a brain, the LLM is like the prefrontal cortex or language and decision network in the outermost layer, but we would still need the amygdala in that metaphor - emotional drives, urges, episodic first-person memories, and experiential components that accompany the language and complete it with personhood.
For raw sensations and tactiles we might need that innermost brain stem - which is probably more chemistry than computation - for the "lights to be on". For example, some jobs will require not just language intelligence, and not just personhood, but for the light behind the images and feelings in the sensations, so that it feels (and would be) alive.
I don't know if I agree with this. The attention module is specifically designed to be selective and dynamic, otherwise it would not be much different than a word embedding (look up "soft" weights vs "hard" weights [1]). I think deep learning should not be confused with deep RL. LLMs are autoregressive models which means that they are trained to predict the next token and that is all they do. The next token is not necessarily the most reasonable (this is why datasets are super important for better performance). Deep RL models on the other hand, seem to be excellent at agency and decision making (although in restricted environment), because they are trained to do so.
[1] https://en.wikipedia.org/wiki/Attention_(machine_learning)
In the future the RL part of LLM training will increase a lot. Why am I saying this? There are two sources for learning - the past and the present. Training on human text is using past data, that is off-policy. But training on interactive data is on-policy. There is nothing we know that doesn't come from the environment. What is not written in any books must be learned from outside.
That is why I think supervised pre-training from human text is just half the story and RL based agent learning, interactivity in other words, is the next step. The two feed on which intelligence stands are language (past experience) and environment (present experience). We can't get ahead without both of them.
AlphaZero showed what an agent can learn from an environment alone, and LLMs show what they can learn from humans. But the world is big, there are plenty of environments that can provide learning signal, in other words feedback to LLMs.
Would LLMs cross this threshold if we were able to train them only on works that are “objectively good”? if someone has better language than this, please enlighten me)
That is to say: coherent, empathetic, transparent, free from bias, substantiated, free from “fluff”.
For example: For science one cannot simply train from all works published in scientific journals because of the papers that have been written irrespective of facts, or had the data changed, or have been written with specific agendas. In most cases even the experts have a hard time weeding out all the papers that are not “objectively good”. How could an LLM hope to make the determination during training?
In order for the LLM to meaningfully model what is coherent, empathetic, free from bias, it must also model the close to, but NOT-that.
If you’ll indulge me I’m going to think out loud a little.
What makes sense to me about this point:
- Having zero knowledge of “non-good” could lead to fragility when people phrase questions in “non-good” ways
- If an LLM is truly a “I do what I learned” machine, then “good” input + “good” question would output “good” output
- There may be a significant need for an LLM to learn the “chair is not-a-stool” aka “fact is not-a-fiction”. An LLM that only gets affirming meanings might be wildly confused. If true I think that would be a an interesting area to research not just for AI but for cognition. … now I wonder how many of the existing params are “not”s.
- There’s also the question of scale. Does an LLM need to “know” about mass extinction in order to understand empathy? Or can it just know about the emotions people experience during hard times? Children seem to do fine at empathy (maybe even better than adults in some ways) despite never being exposed to planet-sized tragedies. Adults need to deal with bigger issues where it can be important to have those tragedies front of mind, but does an LLM need to?
People have mentioned some other problems apart from those in the article. Someone should compile these and put them up if they haven't been already
Oh let's be cerebral about this stuff and ignore silly British nonsense. LLMs are a classic example of garbage in, garbage out, with a shonky curve fit veneer of science.
A next token guesser with a rather varied input quality is going to go off on one rather often. Given that we all have a different idea of truth adds to the fun.
I take care that my monocle doesn't snag in my lathe. Do be careful with your nob when tickling your LLM inappropriately.
To anyone using LLMs for meaningful code, I wish you luck maintaining the code long term and hope you really do enjoy doing code reviews.
Edit: ah, I spoke too soon. My first question was too "easy" but I asked a few more, and sure enough... it can understand what I'm asking and it can write an answer that's well formed, but it's fundamentally not understanding the rules of the game or giving me valid guesses. Cute!
They do seem to know the first letter of each word pretty well (they love to alliterate and can also produce a list of things starting with A then B then C etc) but they are all very terrible at spelling beyond that. I presume they could be trained to spell even while retaining tokens but I guess I don't know for certain.
I keep thinking there could be a way to iteratively train an LLM with declarative prompts, but as the article points out, it’s the chicken and egg problem. The LLM can’t provide a response unless it already knows the answer.
However, I believe this barrier will eventually be overcome. Just not anytime soon.
I might just get nerd sniped trying to teach it GoL now…
Yes but ultimately that includes all of math, logic, science, physics, etc. which as far as we can tell are fundamental truths of the universe. And if there's a large enough LLM that can capture enough constraints, functionally what's the difference between its intelligence and ours?
https://chat.openai.com/share/fc3cbc58-259f-4725-b9fd-df42df...
Once they absorb theorem provers, they will be able to do lots of math provably correctly. That does mean they should be unable to state "I have proved theorem A in Theory T with proof merkle root R" unless they actually did just that.
Can someone explain this sentence to me? It looks broken (hard to what?).
His attempts at training on Conway's game of life are kind of pathetic. The problem isn't a lack of training data and neither is it's "distribution". The fallacy lies in the fact that the dataset itself doesn't contain reasoning in the first place. For example, GitHub CoPilot has fill in the middle capability, while ChatGPT by default does not.
Now here is the shocker about the fill in the middle capability. How does the LLM learn to do it? It does it in an incredibly primitive way. Instead of building a model that can edit its own context, it receives a marker in the context that tells it about the cursor position and then it is finetuned on the expected response.
This means that an LLM could be trained to insert its token at any position in the context or even replace existing tokens, but here is the problem: Once the model has modified its own context, it has exited the training dataset. How do you evaluate the intermediate steps, which can consist of genuinely novel thoughts which are required, but not present in the data? Adding two numbers requires intermediate states which the model may even know how to produce, but it can never be rewarded to utilize them, if they aren't in the training data, because for the LLM, the only goal is to conform to the dataset.
If you wanted to avoid this, you would need to define a metric which allows the model to be rewarded for a success even if that success took a detour. Currently, training is inherently built around the idea of zero shot responses.
This is not a tokenization artefact. And furthermore it's a problem for human brains as well.
Let's say you get a name, idk, Tom Cruise. You immediately know what his face looks like. Now let's say you get a random face. How quickly would you be able to tell me what that person is named? Likely a lot of "uhhs" and "ermms" will follow. It's super hard for us to generalize this reversal automatically in lots of cases. Associations tend to be one directional.
Here's the intro to the paper that brought this to light: https://www.lesswrong.com/posts/SCqDipWAhZ49JNdmL/paper-llms...
If you mean “consistent with a prior run with the same input”, LLMs can absolutely do that, though for most purposes settings are chosen deliberately which do not.
If you mean “consistent with some external structural specification”, LLMs can do that, too, e.g., via grammar specifications.
Why should it surprise anyone that it would fail at cellular automata?
A simple example is that they fundamentally can not balance parentheses more than half their context width.
We've all seen something that looks amazing, but few seem to know what we're looking at.
I am unsettled by what I see as a division of thought between extolling AI's amazing effects on one hand and mysterious regards for how it works and its limits on the other.
Noting Arthur Clarke's dictum that 'technology can be sufficiently advanced to be indistinguishable from magic,' AI enthusiasm looks like a Feynman cargo cult.
But technology arousing magical thinking with little discussion of principle of operation and limits is common enough.
This was very much the case with arrival of personal computing: there's something the device is intended to do but most people aren't sure what that is. The devices crashed enough and went out of date so fast you felt ok for not understanding them.
It was even worse with the mobile+web as so much change happened so fast that a generation has been dumbstruck: look at USA politics.
I was looking at old episodes of the Computer Chronicles from early 90s on YT and by that time the show had close to a thousand episodes, but they could barely explain the significance of Windows 3.0 and the Pentium. As to what to expect from this stuff, they didn't even try it was mindless rambling and upsell interspersed with stern warnings from the Software Publishers Association that mucking with the code is a Federal offense. The show's guests all had something to sell with a half life of 3–6 months. For the Pentium episode they show a PC lab with nerds in Dockers (khaki pants) studiously examining how many fans it might take to keep a lanman server from overheating and crashing. Many were amazed by it all.
Also available on YT are old ATT videos, including an introduction to UNIX with Kernighan & Richie. The presentation entirely focuses on the power of the shell. They seemed much more reserved and competent in retrospect, but in its time they looked like a pure priesthood.
Maybe the arrival AI stuff is not so different from the arrival of personal computing?
But AI is basically just one app, and I get the feeling that the scene is far more enigmatic to the point that even the people building the kit don't really know why it does what it does, and no one seems to have a clear idea of what correct functioning means.
LLMs mimic human language which is separate from reasoning. Tech bros are remarkably ignorant of the field of linguistics and don’t appreciate this distinction. They thus mistake the output of LLMs for reason.
I suspect these are all tokenization artifacts, but I'll probably take some time to try out the Conway's Game of Life problem by finetuning a model. A few issues I've noticed from the problems proposed in the article:
1. Wordle. This one TBH is a clear tokenization problem, not a proof of the reasoning capabilities of LLMs or lack thereof. LLMs are trained on multi-character tokens, and consume words as multi-character tokens: they don't "see" characters. Wordle is primarily a game based around splitting words into discrete characters, and LLMs can't see the characters they're supposed to operate on if you give them words — and depending on how you structure your answers, they also might not be able to see your answers! By breaking the words and answers into character-by-character sequences with spaces in between the characters (forcing the tokenizer into breaking each character into a separate token visible to the LLM), I successfully got GPT-4 to guess the word "BLAME" on my first attempt at playing Wordle with it: https://chat.openai.com/share/cc1569c4-44c3-4024-a0c2-eeb498...
2. Conway's Game of Life. Once again, the input sequences are given as a single, long string with no spacing, which will probably result in it being tokenized and thus partially invisible to the LLM. This one seems somewhat annoying to prompt, so I haven't tried yet, but I suspect a combination of better prompting and maybe finetuning would result in the LLM learning to solve the problem.
Similarly, complaints about finetuned models not being able to generalize well on input sequences of lengths longer than they were trained on are most likely token-related. Each token an LLM sees (both during training and inference) is encoded alongside its absolute position in the input sequence; while you as a human being see 1 and 1 1 and 1 1 1 as repeated series of 1s, an LLM would see those characters as being at least somewhat distinct. Given a synthetic dataset of a specific size, it can start to generalize over problems within the space that it sees, but if you give it new data outside of that context space, the new data will not be visible to the LLM as being necessarily related to what it was trained on. There are architectural tricks to get around it (e.g. RoPE scaling), but in general I wouldn't make generalizations about what models can or can't "reason" about based on using context window sizes the model didn't see during training: that's more about token-related blindspots and not about whether the model can be intelligent — at least, intelligent within the context window it's trained on.
One thing the author repeats several times throughout the article is that the mistakes LLMs make are far more instructive than their successes. However, I think in general this is not the case: if they can succeed sometimes, anyone who's spent much time finetuning knows that you can typically train them to succeed more reliably. And the mistakes here don't necessarily seem instructive at all: they're tokenization artifacts, and rewriting the problem to work around specific types of blindness (at least in Wordle's case) seems to allow the LLMs to succeed.
FWIW, the author brings up Victor Taelin's famous A::B problem; I believe I was the first to solve it [1] (albeit via finetuning, so ineligible for the $10k prize; although I did it before the prize was announced, just for the pleasure of playing around with an interesting problem). While I think that it's generally a useful insight to think of training as giving more intuition than intelligence, I do think the A::B problem getting solved eventually even by pure prompting shows that there's actually intelligence in there, too — it's not just intuition, or stochastic parroting of information from its training set. However, tokenization issues can easily get in the way of these kinds of problems if you're not aware of them (even in the winning Clause 3 Opus prompt slightly rephrased the problem to get it to work with the tokenizer), so the models actually can appear dumber than they really are.
1. https://twitter.com/reissbaker/status/1776531331562033453
We cannot think of anything beyond 4 dimension, so therefore there must be nothing beyond that or that things that exist in those dimension doesn't matter that much. Or more precisely, we simply cannot appreciate those things.
If we are simply trying to mimic human intelligence...well, you are going to end up with a human brain.
Suppose we have a concept X that humans simply cannot comprehend, appreciate or solve, well, why bother create an intelligence to solve that?
From this hypothesis, I personally think that any intelligence that we create will simply be an augmentation of what human desire. That is, there will always be a human part in the cog because human is the only thing can appreciate what is being created so any and all output must cater to the human involved. This will inevitably happen because we want whatever it is the human brain is doing, without doing whatever it is that the human brain is doing.
That is until we unleash a different intelligence system with agency.