undefined | Better HN

0 pointsrustystump7mo ago0 comments

The problem is that the overwhelming majority of input it has in-fact seen somewhere in the corpus it was trained on. Certainly not one for one but easily an 98% match. This is the whole point of what the other person is trying to comment on i think. The reality is most of language is regurgitating 99% to communicate an internal state in a very compressed form. That 1% tho maybe is the magic that makes us human. We create net new information unseen in the corpus.

0 comments

7 comments · 2 top-level

crazygringo7mo ago· 3 in thread

> the overwhelming majority of input it has in-fact seen somewhere in the corpus it was trained on.

But it thinks just great on stuff it wasn't trained on.

I give it code I wrote that is not in its training data, using new concepts I've come up with in an academic paper I'm writing, and ask it to extend the code in a certain way in accordance with those concepts, and it does a great job.

This isn't regurgitation. Even if a lot of LLM usage is, the whole point is that it does fantastically with stuff that is brand new too. It's genuinely creating new, valuable stuff it's never seen before. Assembling it in ways that require thinking.

rustystumpOP7mo ago

I think you may think too highly of academic papers or more so that they oft still only have 1% in there.

crazygringo7mo ago

I think you're missing the point. This is my own paper and these are my own new concepts. It doesn't matter if the definition of the new concepts are only 1% of the paper, the point is they are the concepts I'm asking the LLM to use, and are not in its training data.

1 more reply

zeroonetwothree7mo ago

I think it would be hard to prove that it's truly so novel that nothing similar is present in the training data. I've certainly seen in research that it's quite easy to miss related work even with extensive searching.

the_pwner2247mo ago· 2 in thread

Except it's more than capable of solving novel problems that aren't in the training set and aren't a close match to anything in the training set. I've done it multiple times across multiple domains.

Creating complex Excel spreadsheet structures comes to mind, I just did that earlier today - and with plain GPT-5, not even -Thinking. Sure, maybe the Excel formulas themselves are a "98% match" to training data, but it takes real cognition (or whatever you want to call it) to figure out which ones to use and how to use them appropriately for a given situation, and how to structure the spreadsheet etc.

rustystumpOP7mo ago

I think people confuse novel to them with novel to humanity. Most of our work is not so special

the_pwner2247mo ago

And what % of humans have ever thought things that are novel to humanity?

j / k navigate · click thread line to collapse

0 comments

7 comments · 2 top-level

crazygringo7mo ago· 3 in thread

> the overwhelming majority of input it has in-fact seen somewhere in the corpus it was trained on.

But it thinks just great on stuff it wasn't trained on.

rustystumpOP7mo ago

I think you may think too highly of academic papers or more so that they oft still only have 1% in there.

crazygringo7mo ago

1 more reply

zeroonetwothree7mo ago

the_pwner2247mo ago· 2 in thread

rustystumpOP7mo ago

I think people confuse novel to them with novel to humanity. Most of our work is not so special

the_pwner2247mo ago

And what % of humans have ever thought things that are novel to humanity?

j / k navigate · click thread line to collapse