undefined | Better HN

0 pointswavemode4mo ago0 comments

Statistical models generalize. If you train a model that f(x) = 5 and f(x+1) = 6, the number 7 doesn't have to exist in the training data for the model to give you a correct answer for f(x+2)

Similarly, if there are millions of academic papers and thousands of peer reviews in the training data, a review of this exact paper doesn't need to be in there for the LLM to write something convincing. (I say "convincing" rather than "correct" since, the author himself admits that he doesn't agree with all the LLM's comments.)

I tend to recommend people learn these things from first principles (e.g. build a small neural network, explore deep learning, build a language model) to gain a better intuition. There's really no "magic" at work here.

0 comments

15 comments · 7 top-level

kristiandupont4mo ago· 3 in thread

I had Claude help me get a program written for Linux to compile on macOS. The program is written in a programming language the author invented for the project, a pretty unusual one (for example, it allows spaces in variable names).

Claude figured out how the language worked and debugged segfaults until the compiler compiled, and then until the program did. That might not be magic, but it shows a level of sophistication where referring to “statistics” is about as meaningful as describing a person as the statistics of electrical impulses between neurons.

compass_copium4mo ago

But the programming language has explicitly laid out rules. It was not trained on those sets of rules, but it was trained on many trillions of lines of code. It has a map of how programs work, and an explanation of this new language. It's using training data and data it's fed to generate that result.

selridge4mo ago

What doesn't that explain tho?

What behavior would you need to see for that explanation to no longer hold? Because it seems like it explains too much.

3 more replies

orf4mo ago

That’s still over-general to the point of being useless.

What you wrote would apply to a human approaching this task as well, sans the “many trillion lines of code”.

c224mo ago· 2 in thread

> If you train a model that f(x) = 5 and f(x+1) = 6, the number 7 doesn't have to exist in the training data for the model to give you a correct answer for f(x+2)

This is an interesting claim to me. Are there any models that exist that have been trained with a (single digit) number omitted from the training data?

If such a model does exist, how does it represent the answer? (What symbol does it use for the '7'?)

wavemodeOP4mo ago

When I say "model" here I'm referring to any statistical model (in this example, probably linear regression). Not specifically large language models / neural networks.

c224mo ago

Gotcha, I don't think I know enough about it. What constitutes training data for a for a (non neural network) statistical model? Is this something I could play around with myself with pen and paper?

2 more replies

selridge4mo ago· 2 in thread

Ok cool cool. Instead of pretending you need to teach me, you could engage with what I'm saying or even the OP!

"I don't know how you get here from "predict the next word"" is not really so much a statement of ignorance where someone needs you to step in but a reflection that perhaps the tech is not so easily explained as that. No magic needs to be present for that to be the case.

wavemodeOP4mo ago

If you disagree with someone on the internet, you can just say "I disagree, and here's why". You don't have to aggressively accuse them of "not engaging" with the text.

I engaged. You just don't like what I wrote. That's okay.

selridge4mo ago

Thanks but no thanks.

arkh4mo ago· 1 in thread

I expected (and still expect) a lot from LLM with cross disciplinary research.

I think they should be the perfect tool to find methods or results in a field which look like it could be used in another field.

WithinReason4mo ago

This might actually be a limitation of the "predict next word" approach since the network is never trained to predict a result in one field from a result in another. It might still make the connection though, but not as easily.

Kim_Bruning4mo ago

If you run an LLM in an autoregressive loop you can get it to emulate a turing machine though. That sort of changes the complexity class of the system just a touch. 'Just predicts the next word' hits different when the loop is doing general computation.

Took me a bit of messing around, but try to write out each state sequentially, with a check step between each.

ainch4mo ago

Sorry but this is famously not true! There is no guarantee that statistical models generalise. In your example, whether or not your model generalises depends entirely on what f(x) you use - depending on the complexity of your function class f(x+2) could be 7, 8, or -500.

One of the surprises of deep learning is that it can, sometimes, defy prior statistical learning theory to generalise, but this is still poorly understood. Concepts like grokking, double descent, and the implicit bias of gradient descent are driving a lot of new research into the underlying dynamics of deep learning. But I'd say it is pretty ahistoric to claim that this is obvious or trivial - decades of work studied "overfitting" and related problems where statistical models fail to generalise or even interpolate within the support of their training data.

red75prime4mo ago

I think the relevant question is: can a statistical model (or a transformer, in particular) generalize to general reasoning ability?

1 more reply

j / k navigate · click thread line to collapse

0 comments

15 comments · 7 top-level

kristiandupont4mo ago· 3 in thread

compass_copium4mo ago

selridge4mo ago

What doesn't that explain tho?

What behavior would you need to see for that explanation to no longer hold? Because it seems like it explains too much.

3 more replies

orf4mo ago

That’s still over-general to the point of being useless.

What you wrote would apply to a human approaching this task as well, sans the “many trillion lines of code”.

c224mo ago· 2 in thread

> If you train a model that f(x) = 5 and f(x+1) = 6, the number 7 doesn't have to exist in the training data for the model to give you a correct answer for f(x+2)

This is an interesting claim to me. Are there any models that exist that have been trained with a (single digit) number omitted from the training data?

If such a model does exist, how does it represent the answer? (What symbol does it use for the '7'?)

wavemodeOP4mo ago

When I say "model" here I'm referring to any statistical model (in this example, probably linear regression). Not specifically large language models / neural networks.

c224mo ago

Gotcha, I don't think I know enough about it. What constitutes training data for a for a (non neural network) statistical model? Is this something I could play around with myself with pen and paper?

2 more replies

selridge4mo ago· 2 in thread

Ok cool cool. Instead of pretending you need to teach me, you could engage with what I'm saying or even the OP!

wavemodeOP4mo ago

If you disagree with someone on the internet, you can just say "I disagree, and here's why". You don't have to aggressively accuse them of "not engaging" with the text.

I engaged. You just don't like what I wrote. That's okay.

selridge4mo ago

Thanks but no thanks.

arkh4mo ago· 1 in thread

I expected (and still expect) a lot from LLM with cross disciplinary research.

I think they should be the perfect tool to find methods or results in a field which look like it could be used in another field.

WithinReason4mo ago

Kim_Bruning4mo ago

Took me a bit of messing around, but try to write out each state sequentially, with a check step between each.

ainch4mo ago

red75prime4mo ago

I think the relevant question is: can a statistical model (or a transformer, in particular) generalize to general reasoning ability?

1 more reply

j / k navigate · click thread line to collapse