>not true, especially for language. if you trained a large & deep MLP language model with no self-attention, no matter how much data you'll feed it you'll still be lacking behind a transformer (with much less data). will it get to the same point? i don't think so. your tokens cannot even see each other in a raw MLP.
>on the other hand, tiny tweaks to transformers may not matter as much as data/compute. sure. but it's also not very accurate to say "architecture research" does not matter and "makes no difference". i hear this a lot about how people use this to justify not innovating at the architecture level.
>the truth is the community stands on the shoulder of giants of all the arch research that have been done to push the transformer to this state today.
>architecture research matters. many people just take it for granted these days.
Of course architecture matters in this regard lol. Comparing a CNN to a transformer is like comparing two children brought up in the same household but one has a severe disability.
What I meant in this blog post was that given two NNs which have the same basic components that are sufficiently large and trained long enough on the same dataset, the "behavior" of the resulting models is often shockingly similar. "Behavior" here means the typical (mean, heh) responses you get from the model. This is a function of your dataset distribution.
:edit: Perhaps it'd be best to give a specific example: Lets say you train two pairs of networks: (1) A Mamba SSM and a Transformer on the Pile. (2) Two transformers, one trained on the Pile, the other trained on Reddit comments. All are trained to the same MMLU performance.
I'd put big money that the average responses you get when sampling from the models in (1) are nearly identical, whereas the two models in (2) will be quite different.
You, sir, are my hero.
Your argument of "if we train a Mamba SSM to be as good as a Transformer, then it'll be as good as a Transformer", seems a tad circular...
I think this MLP universal approximator notion is similar to a Turing machine being a universal computation device. Correct, but practically useless.
I don't think Sutton's bitter lesson is going to result in everything being an MLP. You want the most scalable architecture, which an MLP certainly is not.
Some tasks are going to be easier to learn that others, and certainly in general you can have more than one architecture capable of learning a given task, as long as it is sufficiently powerful (combination of architecture + size), and well trained.
That said, it's notable that all the Pareto optimal LLMs are transformer-based, and that in the 7 years since the attention paper (2017), all we have seen in terms of architectural change have been scaling up or minor tweaks like MoE and different types of attention.
How do you make a different architecture such as Mamba more competitive with transformers? Add some transformer layers to it (Jamba) !
So, yeah, as far as LLMs go, the precise model doesn't matter as long as it's a transformer, which isn't very surprising given what we know about how they work - primarily via induction heads. The lesson here isn't that architecture doesn't matter for LLMs, but rather that the architecture has to be a transformer! Data then becomes paramount, because the model learns the program (induction heads, etc) that runs on the machine (transformer) from the data.
No doubt there will be architectural advances beyond transformers, although few people seem to be currently looking for them, but I'm pretty sure they will still need something equivalent to the transformer's attention mechanism.
His conclusion is that "It implies that model behavior is not determined by architecture, hyperparameters, or optimizer choices. It’s determined by your dataset, nothing else".
There is an implicit assumption here that seems obviously false - that this "convergence point" of predictive performance represents the best that can be done with the data, which is to imply that these current models are perfectly modelling the generative process - the human brain.
This seems highly unlikely. If they are perfectly modelling the human brain, then why do they fail so badly at so many tasks? Just lack of training data?
The model architecture is 100% the thing that makes LLMs special. You would not get this doing token prediction with word2vec.
The model sizes are also hugely important. Adding billions of parameters does introduce the capability to fit to new features.
The models eventually reach saturation of how much they can fit to. There’s reason to believe that current LLMs are underfit to what their sizes could theoretically utilize, but it could also be that the optimization algorithms are simply not capable of easily and efficiently utilizing another 2x data to fill out the space. Doubling the model size, on the same training data, and letting it be even more underfit could result in a better model.
So far it doesn't seem to be panning out that way though. Companies such as OpenAI, Anthropic and Reka don't have any special internal sources of data, yet all have trained SOTA models.
Probably the main reason for this is that data type/quality matters more than quantity, which is why most of these companies are now using self-generated synthetic data.
The companies/institutes that will have a data advantage are those that have private datasets consisting of a different type (or maybe higher quality?) of data than publicly available, but this seems more likely to be in specialized domains (medical, etc), rather than what is useful for general intelligence.
I assume that, longer term, we'll have better AI architectures capable of realtime learning, and then the focus may switch on-the-job training and learning ability, rather than data.
A small, well-curated, well-annotated dataset will always be orders of magnitude better than a gigantic one with even a tiny percentage of mislabeled features or bad/wrong data. Hyperparameters and such can be fiddled with once you know you are on the right the track and in the scheme of things are relatively minor for most purposes.
Of course, this advice gets routinely ignored as people spend countless hours fussing over how to set certain flags and grabbing as much data as possible, then carelessly throwing it all together and training it. Then, wondering why the model does things they don't want, they go back to messing with the parameters again.
It is a giant pain in the ass but you have to spend the time sitting in front of the screen going through the data and removing things and tagging things and making sure that the details are right. This is really what makes the good models good and the rest mediocre.
* https://www.unite.ai/everything-you-need-to-know-about-llama...
That being said, definitely if you use a linear model (like lasso) vs a tree based model (like XGBoost), you'll see differences, but once you have a flexible enough model and a lot of data, training time and inference complexity tend to become better ways to make a model choice.
There are countless competitions, etc. on Kaggle, AICrowd, or other platforms with an enforced standardized data set. Every entrant uses the same data set and there's a huge difference between the best and worst submissions.
Are you referring to the current state of our best existing models or the potential future of ML? I find it incredibly hard to see how an LLM could implement the best “physically allowable” approximation to Solomonoff induction.
Then again, I thought it was extremely unlikely neural networks would have the abilities they currently exhibit, so who knows.
It is indeed a marvel that it works nearly as well as it does.
But then again, evolution is even dumber (in the sense that it only makes random choices that thrive or perish, and can't even take gradients into account), but evolution has still managed to produce intelligent critters.
I guess when you have enough dimensions greedy approaches to optimisation / hill climbing can work well enough, even when you have challenging problems?
Especially if you are allowed to move to some meta levels. Eg evolution doesn't build planes, it built brains that can figure out how to build planes. Similarly with back propagation perhaps.
The most notable voice refuting this opinion on Twitter was Yi Tay (founder of Reka.ai), who definitely does not belong to either of those categories!
Tay (ex. Google Brain) founded Reka.ai two years ago, and their latest multimodal language model is close to SOTA in performance.
Also arguments from authority are boring.
You would get into natural language modelling because you had a deep love of language. Because you think you're close to figuring language out in a systematic way, with just a few years more study.
There's a certain sadness, I think, in the revelation that the robots don't need the expertise of humanity's greatest experts and masters, they just need us to click all the squares that contain a motorcycle.
It makes me sad when people rediscover things (with massive compute in this case), that were already known.
It's very much spend a year in the lab to save an hour in the library.
We don’t have any dataset of dog or cat experience right? OP probably means that he models learns wat a dog or cat is, right?
I find the whole piece somewhat vague btw. No real insights if you ask me. Sure if all you put in is a dataset, that should be all you get out. What’s surprising (worth HN) here?
I think he's referring to the famous paper: "What is it like to be a bat"
https://en.wikipedia.org/wiki/What_Is_It_Like_to_Be_a_Bat%3F
Yes, "What it means to be" does appear to be meant that way and it didn't occur to me to interpret it the other way.
> Sure if all you put in is a dataset, that should be all you get out. What's surprising (worth HN) here?
You put in a particular choice of nn architecture as well as the dataset. The insight (to the extent that it is insightful, and true) is that the architecture doesn't affect the results you get much compared to the dataset.
The second: still fills like Duh. It’s what these models are meant to do right? Form an internal representation of the relations hidden in the data. It’s what complex systems are, they hold models of reality and use those to predict. That is in fact what Claude Shannon meant with his definition of information. Idk maybe I’m getting it wrong.
In some other comment I read this. Sounds very much like a curation thing. And now I'm wondering; isn't this part already covered by a lot of human beings now interacting with ChatGPT and the like?
My uneducated guess is that a company can scrape the whole world wide web and also have all the low quality content that comes with it, but then strengthen/curate their data and/or model by having it interact with humans? You give this thing a prompt, it comes up with some obvious nonsense, and then you as a human correct this by 'chatting' with it?
Eh, you can still often (!) figure out whether what the LLM says makes sense.
Just like you can often figure out whether a human is bullshitting, by fact checking with other sources, or going over their reasoning.
Start with the best data you can, and task train ("rlhf") behavior not preference.
I think that would be a really cool experiment.
There are probably some really good candidate concepts that just take a small leap of reasoning to reach.
But off the top of my head maybe multiplication? Or the concept of zero. Maybe the wheel?
Edit: if anyone is interesting in doing this kind of stuff, hit me up. (Email in profile). I want to start doing these kinds of things as a side project.
Who's Harry Potter? Approximate Unlearning in LLMs https://arxiv.org/abs/2310.02238
See also The Boy Who Survived: Removing Harry Potter from an LLM is harder than reported https://arxiv.org/abs/2403.12082v1
Though I'm not sure its output would make much sense, and you might have to use beam search (or something like backtracking).
I wonder how you would train a model to directly speak without e. Perhaps you use the general model like above with beamsearch, and then train a new model to directly predict the first models beamsearched-predictions.
https://static.googleusercontent.com/media/research.google.c...
See also "You won't train a better model from your desk: https://news.ycombinator.com/item?id=40155715
Consider a chess engine that plays at grandmaster level, i.e. a human grandmaster can sometimes beat it. Even though it's not the best chess engine in the world, it simulates billions of possible scenarios to decide each move. Yet the grandmaster can still beat it sometimes, even though he clearly isn't thinking about billions of possible scenarios. (On the question of whether human brains may in fact unconsciously process billions of possibilities when deciding a chess move, using some neurological process we haven't discovered, I've heard David Deutsch argue this would be thermodynamically impossible as it would require far more energy than the brain consumes.) So the human grandmaster's brain must be doing something else that we don't understand. I think a similar comparison applies with how an LLM and a human choose the next word to say. An LLM has to run a giant statistical search for candidates. Humans seem to be doing something else.
LLMs don't work this way.
Give me a Neural Net in its first epoch and I shall mold it into anything!
Isn't this exactly what Naftali Tishby has been talking about [1].