The "it" in AI models is the dataset (opens in new tab)

(nonint.com)

108 pointsalvivar2y ago74 comments

74 comments

68 comments · 18 top-level

mnk472y ago· 15 in thread

Yi Tay's response (chief scientist at Reka AI, ex-Google Brain researcher): https://twitter.com/YiTayML/status/1783273130087289021

>not true, especially for language. if you trained a large & deep MLP language model with no self-attention, no matter how much data you'll feed it you'll still be lacking behind a transformer (with much less data). will it get to the same point? i don't think so. your tokens cannot even see each other in a raw MLP.

>on the other hand, tiny tweaks to transformers may not matter as much as data/compute. sure. but it's also not very accurate to say "architecture research" does not matter and "makes no difference". i hear this a lot about how people use this to justify not innovating at the architecture level.

>the truth is the community stands on the shoulder of giants of all the arch research that have been done to push the transformer to this state today.

>architecture research matters. many people just take it for granted these days.

neonbjb2y ago

I'm James Betker.

Of course architecture matters in this regard lol. Comparing a CNN to a transformer is like comparing two children brought up in the same household but one has a severe disability.

What I meant in this blog post was that given two NNs which have the same basic components that are sufficiently large and trained long enough on the same dataset, the "behavior" of the resulting models is often shockingly similar. "Behavior" here means the typical (mean, heh) responses you get from the model. This is a function of your dataset distribution.

:edit: Perhaps it'd be best to give a specific example: Lets say you train two pairs of networks: (1) A Mamba SSM and a Transformer on the Pile. (2) Two transformers, one trained on the Pile, the other trained on Reddit comments. All are trained to the same MMLU performance.

I'd put big money that the average responses you get when sampling from the models in (1) are nearly identical, whereas the two models in (2) will be quite different.

jfyi2y ago

There's not many people who will proudly announce their employer, their name, and where someone can stick it over the course of two public comments these days.

You, sir, are my hero.

wrs2y ago

Please humor me for a moment, because I'm having trouble seeing why this is not just true by definition. Doesn't "training to the same performance" mean that you get the same responses? Or from a different angle: given that the goal of the model is to generate plausible completions based on a training dataset, it seems like plausibility (and therefore performance) is obviously defined by the dataset.

HarHarVeryFunny2y ago

If Mamba really was as capable as a Transformer on tasks requiring accurate attending to long context, then there'd be no need for Jamba (Mamba+Transformer hybrid).

Your argument of "if we train a Mamba SSM to be as good as a Transformer, then it'll be as good as a Transformer", seems a tad circular...

lossolo2y ago

Yeah, I'm not sure how someone could interpret what you said in the way people are citing here. It's actually obvious that you are right in the context of data in LLMs. Look at LLAMA 3, for example there are minimal architectural changes, and its performance is almost at the level of GPT-4. The biggest change was in the dataset.

ahartmetz2y ago

Well, both can be true if you interpret the "it" as "the secret sauce / competitive advantage". A good architecture is a necessary but not sufficient condition for success, but everybody uses more or less the same currently, so data makes the difference. Until the next improvement in architecture.

nkozyra2y ago

Or until we run out of data that actually differentiates the models

segmondy2y ago

I do argue that the IT is the architecture. We have pretty much had all the data that these LLMs were trained on for a long time. The game changer was the architecture not the data. Unless of course you are on the code is data camp ;).

1 more reply

4death42y ago

MLP is a universal approximator, so there’s definitely a configuration that can match an attention mechanism. Whether or not it’d be feasible to train is another question.

HarHarVeryFunny2y ago

Not sure about feasible, but certainly not efficient.

I think this MLP universal approximator notion is similar to a Turing machine being a universal computation device. Correct, but practically useless.

I don't think Sutton's bitter lesson is going to result in everything being an MLP. You want the most scalable architecture, which an MLP certainly is not.

HarHarVeryFunny2y ago

Yes, and note that in terms of different architectures, the author (James Betker) is talking about image generators, while when he's talking about LLMs they are all the same basic architecture - transformers.

Some tasks are going to be easier to learn that others, and certainly in general you can have more than one architecture capable of learning a given task, as long as it is sufficiently powerful (combination of architecture + size), and well trained.

That said, it's notable that all the Pareto optimal LLMs are transformer-based, and that in the 7 years since the attention paper (2017), all we have seen in terms of architectural change have been scaling up or minor tweaks like MoE and different types of attention.

How do you make a different architecture such as Mamba more competitive with transformers? Add some transformer layers to it (Jamba) !

So, yeah, as far as LLMs go, the precise model doesn't matter as long as it's a transformer, which isn't very surprising given what we know about how they work - primarily via induction heads. The lesson here isn't that architecture doesn't matter for LLMs, but rather that the architecture has to be a transformer! Data then becomes paramount, because the model learns the program (induction heads, etc) that runs on the machine (transformer) from the data.

No doubt there will be architectural advances beyond transformers, although few people seem to be currently looking for them, but I'm pretty sure they will still need something equivalent to the transformer's attention mechanism.

geysersam2y ago

Seems like an objection that is slightly beside the point? The claim is not that literally any model gives the same result as a large transformer model, that's obviously false. I think the more generous interpretation of the claim is that the model architecture is relatively unimportant as long as the model is fundamentally capable of representing the functions you need it to represent in order to fit the data.

HarHarVeryFunny2y ago

OP's claim/observation is that "trained on the same dataset for long enough, pretty much every model with enough weights and training time converges to the same point [of inference performance]".

His conclusion is that "It implies that model behavior is not determined by architecture, hyperparameters, or optimizer choices. It’s determined by your dataset, nothing else".

There is an implicit assumption here that seems obviously false - that this "convergence point" of predictive performance represents the best that can be done with the data, which is to imply that these current models are perfectly modelling the generative process - the human brain.

This seems highly unlikely. If they are perfectly modelling the human brain, then why do they fail so badly at so many tasks? Just lack of training data?

geysersam2y ago

Interesting point. But, does the data contain enough information to perfectly model the generative process? Maybe even a very complex and capable model like "the human brain" would fail to model the datset better than large transformers, if that was the only thing they ever saw.

You and me can model the dataset better, but we're already "pre-trained" on reality for decades.

Just because the dataset is large doesn't mean it contains useful information.

1 more reply

omnicognate2y ago

Machine learning insights from e e cummings.

disgruntledphd22y ago· 15 in thread

This makes me sad, not because I disagree with it, but because it's basically common wisdom in the statistical and ML communities (of practitioners). In my experience, the only people who think architecture/model choice makes a huge difference are n00bs and academics.

That being said, definitely if you use a linear model (like lasso) vs a tree based model (like XGBoost), you'll see differences, but once you have a flexible enough model and a lot of data, training time and inference complexity tend to become better ways to make a model choice.

sevagh2y ago

>In my experience, the only people who think architecture/model choice makes a huge difference are n00bs and academics.

There are countless competitions, etc. on Kaggle, AICrowd, or other platforms with an enforced standardized data set. Every entrant uses the same data set and there's a huge difference between the best and worst submissions.

infecto2y ago

> ...on Kagi,....

Did you mean https://www.kaggle.com/?

1 more reply

disgruntledphd22y ago

Agreed but if you look at winning submissions which i did stop doing, a lot of them do very good feature engineering which is not a model related thing.

Xcelerate2y ago

> the only people who think architecture/model choice makes a huge difference are n00bs and academics.

Are you referring to the current state of our best existing models or the potential future of ML? I find it incredibly hard to see how an LLM could implement the best “physically allowable” approximation to Solomonoff induction.

Then again, I thought it was extremely unlikely neural networks would have the abilities they currently exhibit, so who knows.

eru2y ago

We manage to train neural nets to approximate complicated data sets via rather simple process: back propagation.

It is indeed a marvel that it works nearly as well as it does.

But then again, evolution is even dumber (in the sense that it only makes random choices that thrive or perish, and can't even take gradients into account), but evolution has still managed to produce intelligent critters.

I guess when you have enough dimensions greedy approaches to optimisation / hill climbing can work well enough, even when you have challenging problems?

Especially if you are allowed to move to some meta levels. Eg evolution doesn't build planes, it built brains that can figure out how to build planes. Similarly with back propagation perhaps.

HarHarVeryFunny2y ago

> In my experience, the only people who think architecture/model choice makes a huge difference are n00bs and academics.

The most notable voice refuting this opinion on Twitter was Yi Tay (founder of Reka.ai), who definitely does not belong to either of those categories!

Tay (ex. Google Brain) founded Reka.ai two years ago, and their latest multimodal language model is close to SOTA in performance.

https://x.com/YiTayML/status/1779895037335343521

danielbln2y ago

This "n00b" seems to disagree with your sentiment on the importance on architecture: https://news.ycombinator.com/item?id=40155667

disgruntledphd22y ago

Unfortunately Google brain researchers have not yet discovered my brilliance, but if you read my argument it's about the data being much more important than the model. Granted transformers are a great model, but that doesn't refute my point.

Also arguments from authority are boring.

jstummbillig2y ago

Why does it make you sad? It seems intuitiv and simple. And in reality of course the optimisation part is not trivial. What would we better if the "it" was more complicated?

michaelt2y ago

It used to be that people would get into these fields thinking ML would need specifically human insights, deep thinking, and philosophical insights about the nature of consciousness.

You would get into natural language modelling because you had a deep love of language. Because you think you're close to figuring language out in a systematic way, with just a few years more study.

There's a certain sadness, I think, in the revelation that the robots don't need the expertise of humanity's greatest experts and masters, they just need us to click all the squares that contain a motorcycle.

disgruntledphd22y ago

This is 100% not why I am sad, see my other reply for information.

As an aside, it's wild how people put their own spin onto what I said.

Obviously I should have been clearer :shrug:.

1 more reply

sevagh2y ago

> It used to be that people would get into these fields thinking ML would need specifically human insights, deep thinking, and philosophical insights about the nature of consciousness.

What's sadder is coming into a field pre-deciding that the way you approach it "is the right way" and can't tolerate that different mindsets can also get results.

xanderlewis2y ago

How do you know? We’re not there yet.

disgruntledphd22y ago

Because of the way it's presented, as if it's some vast new discovery that OpenAI have made, rather than common wisdom.

It makes me sad when people rediscover things (with massive compute in this case), that were already known.

It's very much spend a year in the lab to save an hour in the library.

macilacilove2y ago

Possibly because being in the business of trying to turn iq edge into money, not data edge into money.

rambambram2y ago· 4 in thread

> It is a giant pain in the ass but you have to spend the time sitting in front of the screen going through the data and removing things and tagging things and making sure that the details are right. This is really what makes the good models good and the rest mediocre.

In some other comment I read this. Sounds very much like a curation thing. And now I'm wondering; isn't this part already covered by a lot of human beings now interacting with ChatGPT and the like?

My uneducated guess is that a company can scrape the whole world wide web and also have all the low quality content that comes with it, but then strengthen/curate their data and/or model by having it interact with humans? You give this thing a prompt, it comes up with some obvious nonsense, and then you as a human correct this by 'chatting' with it?

Hendrikto2y ago

People typically ask LLMs about things they DON‘T know about or understand. So they are not qualified to assess the validity of their answers. Which is exactly why hallucination is such a big problem.

eru2y ago

> People typically ask LLMs about things they DON‘T know about or understand. So they are not qualified to assess the validity of their answers.

Eh, you can still often (!) figure out whether what the LLM says makes sense.

Just like you can often figure out whether a human is bullshitting, by fact checking with other sources, or going over their reasoning.

CuriouslyC2y ago

"Fixing" low quality data with RLHF is a waste of time. By that point it's already poisoned the model distribution, and all you're doing is steering it away from catastrophic failure cases.

Start with the best data you can, and task train ("rlhf") behavior not preference.

ttpphd2y ago

Yeah when you use OpenAI you are giving them free labor for data curation.

1 more reply

Eisenstein2y ago· 3 in thread

As a hobbyist having trained models for different use cases ranging from object detection and recognition to text completion to image generation, the best advice has consistently been to curate and annotate your dataset as perfectly as you can before worrying about anything else.

A small, well-curated, well-annotated dataset will always be orders of magnitude better than a gigantic one with even a tiny percentage of mislabeled features or bad/wrong data. Hyperparameters and such can be fiddled with once you know you are on the right the track and in the scheme of things are relatively minor for most purposes.

Of course, this advice gets routinely ignored as people spend countless hours fussing over how to set certain flags and grabbing as much data as possible, then carelessly throwing it all together and training it. Then, wondering why the model does things they don't want, they go back to messing with the parameters again.

It is a giant pain in the ass but you have to spend the time sitting in front of the screen going through the data and removing things and tagging things and making sure that the details are right. This is really what makes the good models good and the rest mediocre.

0xDEADFED52y ago

the 15T tokens that got thrown at Llama-3 didn't seem to hurt. Will be interesting to see how well Phi-2 holds up with it's more curated approach, hopefully they don't get disappeared like WizardLM 2 =)

Eisenstein2y ago

"The quality of the prompts used in SFT and the preference rankings used in PPO and DPO played a crucial role in the performance of the aligned models. Meta's team carefully curated this data and performed multiple rounds of quality assurance on annotations provided by human annotators."

* https://www.unite.ai/everything-you-need-to-know-about-llama...

redwood2y ago

This is where software developers have a huge role to play: build software that invites user experiences that label as part of the user flow

teekert2y ago· 3 in thread

I don’t get this: “What that means is not only that they learn what it means to be a dog or a cat, …“

We don’t have any dataset of dog or cat experience right? OP probably means that he models learns wat a dog or cat is, right?

I find the whole piece somewhat vague btw. No real insights if you ask me. Sure if all you put in is a dataset, that should be all you get out. What’s surprising (worth HN) here?

empath-nirvana2y ago

> “What that means is not only that they learn what it means to be a dog or a cat, …“

I think he's referring to the famous paper: "What is it like to be a bat"

https://en.wikipedia.org/wiki/What_Is_It_Like_to_Be_a_Bat%3F

omnicognate2y ago

> OP probably means that he models learns wat a dog or cat is, right?

Yes, "What it means to be" does appear to be meant that way and it didn't occur to me to interpret it the other way.

> Sure if all you put in is a dataset, that should be all you get out. What's surprising (worth HN) here?

You put in a particular choice of nn architecture as well as the dataset. The insight (to the extent that it is insightful, and true) is that the architecture doesn't affect the results you get much compared to the dataset.

teekert2y ago

Ok the first thing must be just my non-native speaker mind then.

The second: still fills like Duh. It’s what these models are meant to do right? Form an internal representation of the relations hidden in the data. It’s what complex systems are, they hold models of reality and use those to predict. That is in fact what Claude Shannon meant with his definition of information. Idk maybe I’m getting it wrong.

bilsbie2y ago· 3 in thread

Has anyone tried removing an entire concept from a dataset and seeing if the LLM can reason its way into the concept?

I think that would be a really cool experiment.

There are probably some really good candidate concepts that just take a small leap of reasoning to reach.

But off the top of my head maybe multiplication? Or the concept of zero. Maybe the wheel?

Edit: if anyone is interesting in doing this kind of stuff, hit me up. (Email in profile). I want to start doing these kinds of things as a side project.

andy992y ago

There was one where they tried to remove Harry Potter...

Who's Harry Potter? Approximate Unlearning in LLMs https://arxiv.org/abs/2310.02238

See also The Boy Who Survived: Removing Harry Potter from an LLM is harder than reported https://arxiv.org/abs/2403.12082v1

queuebert2y ago

I want to see an LLM that generates answers without the letter 'e', like the novel Gadsby by Ernest Vincent Wright.

eru2y ago

If you had one that was character based (instead of the weird encoding they tend to use), you could directly sample without e.

Though I'm not sure its output would make much sense, and you might have to use beam search (or something like backtracking).

I wonder how you would train a model to directly speak without e. Perhaps you use the general model like above with beamsearch, and then train a new model to directly predict the first models beamsearched-predictions.

zer0gravity2y ago· 3 in thread

This insight makes one wonder if the same thing applies to humans as well. Are we just the sum of our experiences? Or the architectures of our brains are much more complex and different so that they have more influence on the outputs for the same inputs?

cal852y ago

I think it's the latter. We may well have some subsystems that work like LLMs or other current AIs, but the overall system of a human mind seems to work in a fundamentally different way, as it's able to make good creative choices (such as the next word to say) without looking at lots of options.

Consider a chess engine that plays at grandmaster level, i.e. a human grandmaster can sometimes beat it. Even though it's not the best chess engine in the world, it simulates billions of possible scenarios to decide each move. Yet the grandmaster can still beat it sometimes, even though he clearly isn't thinking about billions of possible scenarios. (On the question of whether human brains may in fact unconsciously process billions of possibilities when deciding a chess move, using some neurological process we haven't discovered, I've heard David Deutsch argue this would be thermodynamically impossible as it would require far more energy than the brain consumes.) So the human grandmaster's brain must be doing something else that we don't understand. I think a similar comparison applies with how an LLM and a human choose the next word to say. An LLM has to run a giant statistical search for candidates. Humans seem to be doing something else.

famouswaffles2y ago

>An LLM has to run a giant statistical search for candidates. Humans seem to be doing something else.

LLMs don't work this way.

cal852y ago

Could you elaborate? If my understanding of this is significantly off then I’d appreciate if you could explain.

1 more reply

tppiotrowski2y ago· 2 in thread

I took the Andrew Ng Coursera machine learning course in 2015 and to this day I still remember him saying this in one of the videos. At the time he was talking about various versions/optimizations of gradient descent but he essentially said that tweaking the algorithm will only make your model ~1% better while doubling the amount of training data will have a substantially larger impact (use any old algorithm but just throw more data at the problem). That's why it was already evident back then that Google, Facebook, etc were sitting on a goldmine because in the long run those with the most data, not the brightest PhDs will win this race.

jncfhnb2y ago

There’s some enormous caveats to this.

The model architecture is 100% the thing that makes LLMs special. You would not get this doing token prediction with word2vec.

The model sizes are also hugely important. Adding billions of parameters does introduce the capability to fit to new features.

The models eventually reach saturation of how much they can fit to. There’s reason to believe that current LLMs are underfit to what their sizes could theoretically utilize, but it could also be that the optimization algorithms are simply not capable of easily and efficiently utilizing another 2x data to fill out the space. Doubling the model size, on the same training data, and letting it be even more underfit could result in a better model.

HarHarVeryFunny2y ago

> That's why it was already evident back then that Google, Facebook, etc were sitting on a goldmine because in the long run those with the most data, not the brightest PhDs will win this race.

So far it doesn't seem to be panning out that way though. Companies such as OpenAI, Anthropic and Reka don't have any special internal sources of data, yet all have trained SOTA models.

Probably the main reason for this is that data type/quality matters more than quantity, which is why most of these companies are now using self-generated synthetic data.

The companies/institutes that will have a data advantage are those that have private datasets consisting of a different type (or maybe higher quality?) of data than publicly available, but this seems more likely to be in specialized domains (medical, etc), rather than what is useful for general intelligence.

I assume that, longer term, we'll have better AI architectures capable of realtime learning, and then the focus may switch on-the-job training and learning ability, rather than data.

sampo2y ago· 1 in thread

Alon Halevy, Peter Norvig, and Fernando Pereira (2009): The Unreasonable Effectiveness of Data

https://static.googleusercontent.com/media/research.google.c...

0xDEADFED52y ago

Rylan Schaeffer (2023): Pretraining on the Test Set Is All You Need

https://arxiv.org/abs/2309.08632

andy992y ago· 1 in thread

Yes, and it's what people seem to ignore when they talk about dethroning GPT4 as the top LLM. It's good data expressly developed for training the behaviors they want that keeps them ahead, all the other stuff (other training and filtering web data) has much less of an impact.

See also "You won't train a better model from your desk: https://news.ycombinator.com/item?id=40155715

CuriouslyC2y ago

I don't think GPT4 is the top LLM, it's good at coding and good at understanding poorly written prompts but its high level prompt following and creativity are not great. GPT4 likes to answer a particular way and when your question matches up with that it'll seem very smart, but when it doesn't the rails it is on are very obvious.

pyinstallwoes2y ago

So "it" is the collective unconscious of humanity? The egregore of us all, our collective spirit? I see.

tadala2y ago

Ah the nature vs nurture debate, we meet again!

Give me a Neural Net in its first epoch and I shall mold it into anything!

pk-protect-ai2y ago

That is what I have repeated so many times in the last 2 years over and over. I consider Yi Tay's response [1] a mere technicality that is actually irrelevant. What is relevant is how predictable "interpolatable" the data are, how predictable we are.

1. https://twitter.com/YiTayML/status/1783273130087289021

chrisdirl2y ago

Is the secret sauce also tied to the generation distribution which can differ from the dataset distribution e.g. RLHF?

troq132y ago

Weak argument for something everyone already knew. Nice you work at openAI, I guess.

tilt_error2y ago

Is this a surprise?

Isn't this exactly what Naftali Tishby has been talking about [1].

[1] https://www.youtube.com/watch?v=XL07WEc2TRI

iNic2y ago

The only thing this glosses over is RL. I guess you can see agents interacting in environments as a type of "dataset", but it _feels_ different.

redwood2y ago

AKA "group think"

j / k navigate · click thread line to collapse

74 comments

68 comments · 18 top-level

mnk472y ago· 15 in thread

Yi Tay's response (chief scientist at Reka AI, ex-Google Brain researcher): https://twitter.com/YiTayML/status/1783273130087289021

>the truth is the community stands on the shoulder of giants of all the arch research that have been done to push the transformer to this state today.

>architecture research matters. many people just take it for granted these days.

neonbjb2y ago

I'm James Betker.

Of course architecture matters in this regard lol. Comparing a CNN to a transformer is like comparing two children brought up in the same household but one has a severe disability.

I'd put big money that the average responses you get when sampling from the models in (1) are nearly identical, whereas the two models in (2) will be quite different.

jfyi2y ago

There's not many people who will proudly announce their employer, their name, and where someone can stick it over the course of two public comments these days.

You, sir, are my hero.

wrs2y ago

HarHarVeryFunny2y ago

If Mamba really was as capable as a Transformer on tasks requiring accurate attending to long context, then there'd be no need for Jamba (Mamba+Transformer hybrid).

Your argument of "if we train a Mamba SSM to be as good as a Transformer, then it'll be as good as a Transformer", seems a tad circular...

lossolo2y ago

ahartmetz2y ago

nkozyra2y ago

Or until we run out of data that actually differentiates the models

segmondy2y ago

1 more reply

4death42y ago

MLP is a universal approximator, so there’s definitely a configuration that can match an attention mechanism. Whether or not it’d be feasible to train is another question.

HarHarVeryFunny2y ago

Not sure about feasible, but certainly not efficient.

I think this MLP universal approximator notion is similar to a Turing machine being a universal computation device. Correct, but practically useless.

I don't think Sutton's bitter lesson is going to result in everything being an MLP. You want the most scalable architecture, which an MLP certainly is not.

HarHarVeryFunny2y ago

How do you make a different architecture such as Mamba more competitive with transformers? Add some transformer layers to it (Jamba) !

geysersam2y ago

HarHarVeryFunny2y ago

OP's claim/observation is that "trained on the same dataset for long enough, pretty much every model with enough weights and training time converges to the same point [of inference performance]".

His conclusion is that "It implies that model behavior is not determined by architecture, hyperparameters, or optimizer choices. It’s determined by your dataset, nothing else".

This seems highly unlikely. If they are perfectly modelling the human brain, then why do they fail so badly at so many tasks? Just lack of training data?

geysersam2y ago

You and me can model the dataset better, but we're already "pre-trained" on reality for decades.

Just because the dataset is large doesn't mean it contains useful information.

1 more reply

omnicognate2y ago

Machine learning insights from e e cummings.

disgruntledphd22y ago· 15 in thread

sevagh2y ago

>In my experience, the only people who think architecture/model choice makes a huge difference are n00bs and academics.

infecto2y ago

> ...on Kagi,....

Did you mean https://www.kaggle.com/?

1 more reply

disgruntledphd22y ago

Agreed but if you look at winning submissions which i did stop doing, a lot of them do very good feature engineering which is not a model related thing.

Xcelerate2y ago

> the only people who think architecture/model choice makes a huge difference are n00bs and academics.

Then again, I thought it was extremely unlikely neural networks would have the abilities they currently exhibit, so who knows.

eru2y ago

We manage to train neural nets to approximate complicated data sets via rather simple process: back propagation.

It is indeed a marvel that it works nearly as well as it does.

I guess when you have enough dimensions greedy approaches to optimisation / hill climbing can work well enough, even when you have challenging problems?

Especially if you are allowed to move to some meta levels. Eg evolution doesn't build planes, it built brains that can figure out how to build planes. Similarly with back propagation perhaps.

HarHarVeryFunny2y ago

> In my experience, the only people who think architecture/model choice makes a huge difference are n00bs and academics.

The most notable voice refuting this opinion on Twitter was Yi Tay (founder of Reka.ai), who definitely does not belong to either of those categories!

Tay (ex. Google Brain) founded Reka.ai two years ago, and their latest multimodal language model is close to SOTA in performance.

https://x.com/YiTayML/status/1779895037335343521

danielbln2y ago

This "n00b" seems to disagree with your sentiment on the importance on architecture: https://news.ycombinator.com/item?id=40155667

disgruntledphd22y ago

Also arguments from authority are boring.

jstummbillig2y ago

Why does it make you sad? It seems intuitiv and simple. And in reality of course the optimisation part is not trivial. What would we better if the "it" was more complicated?

michaelt2y ago

It used to be that people would get into these fields thinking ML would need specifically human insights, deep thinking, and philosophical insights about the nature of consciousness.

You would get into natural language modelling because you had a deep love of language. Because you think you're close to figuring language out in a systematic way, with just a few years more study.

disgruntledphd22y ago

This is 100% not why I am sad, see my other reply for information.

As an aside, it's wild how people put their own spin onto what I said.

Obviously I should have been clearer :shrug:.

1 more reply

sevagh2y ago

> It used to be that people would get into these fields thinking ML would need specifically human insights, deep thinking, and philosophical insights about the nature of consciousness.

What's sadder is coming into a field pre-deciding that the way you approach it "is the right way" and can't tolerate that different mindsets can also get results.

xanderlewis2y ago

How do you know? We’re not there yet.

disgruntledphd22y ago

Because of the way it's presented, as if it's some vast new discovery that OpenAI have made, rather than common wisdom.

It makes me sad when people rediscover things (with massive compute in this case), that were already known.

It's very much spend a year in the lab to save an hour in the library.

macilacilove2y ago

Possibly because being in the business of trying to turn iq edge into money, not data edge into money.

rambambram2y ago· 4 in thread

In some other comment I read this. Sounds very much like a curation thing. And now I'm wondering; isn't this part already covered by a lot of human beings now interacting with ChatGPT and the like?

Hendrikto2y ago

eru2y ago

> People typically ask LLMs about things they DON‘T know about or understand. So they are not qualified to assess the validity of their answers.

Eh, you can still often (!) figure out whether what the LLM says makes sense.

Just like you can often figure out whether a human is bullshitting, by fact checking with other sources, or going over their reasoning.

CuriouslyC2y ago

"Fixing" low quality data with RLHF is a waste of time. By that point it's already poisoned the model distribution, and all you're doing is steering it away from catastrophic failure cases.

Start with the best data you can, and task train ("rlhf") behavior not preference.

ttpphd2y ago

Yeah when you use OpenAI you are giving them free labor for data curation.

1 more reply

Eisenstein2y ago· 3 in thread

0xDEADFED52y ago

Eisenstein2y ago

* https://www.unite.ai/everything-you-need-to-know-about-llama...

redwood2y ago

This is where software developers have a huge role to play: build software that invites user experiences that label as part of the user flow

teekert2y ago· 3 in thread

I don’t get this: “What that means is not only that they learn what it means to be a dog or a cat, …“

We don’t have any dataset of dog or cat experience right? OP probably means that he models learns wat a dog or cat is, right?

I find the whole piece somewhat vague btw. No real insights if you ask me. Sure if all you put in is a dataset, that should be all you get out. What’s surprising (worth HN) here?

empath-nirvana2y ago

> “What that means is not only that they learn what it means to be a dog or a cat, …“

I think he's referring to the famous paper: "What is it like to be a bat"

https://en.wikipedia.org/wiki/What_Is_It_Like_to_Be_a_Bat%3F

omnicognate2y ago

> OP probably means that he models learns wat a dog or cat is, right?

Yes, "What it means to be" does appear to be meant that way and it didn't occur to me to interpret it the other way.

> Sure if all you put in is a dataset, that should be all you get out. What's surprising (worth HN) here?

teekert2y ago

Ok the first thing must be just my non-native speaker mind then.

bilsbie2y ago· 3 in thread

Has anyone tried removing an entire concept from a dataset and seeing if the LLM can reason its way into the concept?

I think that would be a really cool experiment.

There are probably some really good candidate concepts that just take a small leap of reasoning to reach.

But off the top of my head maybe multiplication? Or the concept of zero. Maybe the wheel?

Edit: if anyone is interesting in doing this kind of stuff, hit me up. (Email in profile). I want to start doing these kinds of things as a side project.

andy992y ago

There was one where they tried to remove Harry Potter...

Who's Harry Potter? Approximate Unlearning in LLMs https://arxiv.org/abs/2310.02238

See also The Boy Who Survived: Removing Harry Potter from an LLM is harder than reported https://arxiv.org/abs/2403.12082v1

queuebert2y ago

I want to see an LLM that generates answers without the letter 'e', like the novel Gadsby by Ernest Vincent Wright.

eru2y ago

If you had one that was character based (instead of the weird encoding they tend to use), you could directly sample without e.

Though I'm not sure its output would make much sense, and you might have to use beam search (or something like backtracking).

zer0gravity2y ago· 3 in thread

cal852y ago

famouswaffles2y ago

>An LLM has to run a giant statistical search for candidates. Humans seem to be doing something else.

LLMs don't work this way.

cal852y ago

Could you elaborate? If my understanding of this is significantly off then I’d appreciate if you could explain.

1 more reply

tppiotrowski2y ago· 2 in thread

jncfhnb2y ago

There’s some enormous caveats to this.

The model architecture is 100% the thing that makes LLMs special. You would not get this doing token prediction with word2vec.

The model sizes are also hugely important. Adding billions of parameters does introduce the capability to fit to new features.

HarHarVeryFunny2y ago

> That's why it was already evident back then that Google, Facebook, etc were sitting on a goldmine because in the long run those with the most data, not the brightest PhDs will win this race.

So far it doesn't seem to be panning out that way though. Companies such as OpenAI, Anthropic and Reka don't have any special internal sources of data, yet all have trained SOTA models.

Probably the main reason for this is that data type/quality matters more than quantity, which is why most of these companies are now using self-generated synthetic data.

I assume that, longer term, we'll have better AI architectures capable of realtime learning, and then the focus may switch on-the-job training and learning ability, rather than data.

sampo2y ago· 1 in thread

Alon Halevy, Peter Norvig, and Fernando Pereira (2009): The Unreasonable Effectiveness of Data

https://static.googleusercontent.com/media/research.google.c...

0xDEADFED52y ago

Rylan Schaeffer (2023): Pretraining on the Test Set Is All You Need

https://arxiv.org/abs/2309.08632

andy992y ago· 1 in thread

See also "You won't train a better model from your desk: https://news.ycombinator.com/item?id=40155715

CuriouslyC2y ago

pyinstallwoes2y ago

So "it" is the collective unconscious of humanity? The egregore of us all, our collective spirit? I see.

tadala2y ago

Ah the nature vs nurture debate, we meet again!

Give me a Neural Net in its first epoch and I shall mold it into anything!

pk-protect-ai2y ago

1. https://twitter.com/YiTayML/status/1783273130087289021

chrisdirl2y ago

Is the secret sauce also tied to the generation distribution which can differ from the dataset distribution e.g. RLHF?

troq132y ago

Weak argument for something everyone already knew. Nice you work at openAI, I guess.

tilt_error2y ago

Is this a surprise?

Isn't this exactly what Naftali Tishby has been talking about [1].

[1] https://www.youtube.com/watch?v=XL07WEc2TRI

iNic2y ago

The only thing this glosses over is RL. I guess you can see agents interacting in environments as a type of "dataset", but it _feels_ different.

redwood2y ago

AKA "group think"

j / k navigate · click thread line to collapse