undefined | Better HN

story

0 pointsuserbinator1y ago0 comments

How can you even "open source" an AI model without all of the, presumably copyrighted and extremely voluminous, training data?

0 comments

beeflet1y ago

That could probably be solved with bit-torrent. I think the bigger obstacle is the hardware required for training. Maybe it would be possible for groups of people to reproduce/train open source models with a distributed BOINC-like system?

thephyber1y ago

You would open source the procedure and reference where the data came from. If there is any non-open source content used in training, then the project couldn’t qualify as “open source”.

But this thread is about misuse of the term as applied to the weights package. Those of us who know what open source means should not continue to dilute the term by calling these LLMs by that term.

1 more reply

mupuff12341y ago

You don't need the data itself, but at least a reference to what was used, basically provide the entire blueprint to recreate it.

It's just like even for a true open source software you still need to bring your own hardware to run it on.

simondotau1y ago

You can't. But that's not an excuse to misuse the label.

a-dub1y ago

that's how you know when you actually have agi, when you have something that you don't have to shovel in every written word known to man to make it work, but rather can seed it with a few dense public domain knowledge compendia and have it derive everything else for itself from those first principles- possibly going through several stages of from scratch training and regeneration.

int_19h1y ago

The reason why you need to shovel every written word known to man to make it work is because it needs to learn what words mean before it can do anything useful with them, and we don't currently know any better way of making a tabula rasa (like a blank NN) do that. Our own brains are hardwired for language acquisition by evolution, so we can few-shot it when learning and get there much faster; and if we understood how it works, we could start with something similarly hardwired and do exactly what you said.

But we don't actually know all that much about how language really works, for all the resources we spend on linguistics - as the old IBM joke about AI goes, "quality of the product increases every time we fire a linguist" (which is to say, we consistently get better results by throwing "every written word known to man" at a blank model than we do by trying to construct things from our understanding).

All that said, just because we're taking a different, and quite possibly slower / less compute-efficient route, doesn't mean that we can't get to AGI in this way.

dragonwriter1y ago

> Our own brains are hardwired for language acquisition by evolution, so we can few-shot it when learning and get there much faster

No, we can’t few shot it and we don't get there faster (but we develop a lot of other capabilities on the way.) We train on a lot more data; the human brain, unlike an LLM, is training on all that data in processes for ”inference”, and it receives sensory data estimated on the order of a billion bits per second, which means by the time we start using language we’ve trained on a lot of data (the 15 trillion tokens from a ~17 bit token vocabulary that Llama3 is something like the size of a few days of human sense data.) Humans just are trained on and process vastly richer multimodal data instead of text streams.

int_19h1y ago

I was talking about language acquisition specifically. Most of the data that you reference is visual input and other body sensations that aren't directly related to that. OTOH humans don't take all that much text to learn to read and write.

1 more reply

j / k navigate · click thread line to collapse

0 comments

beeflet1y ago

thephyber1y ago

You would open source the procedure and reference where the data came from. If there is any non-open source content used in training, then the project couldn’t qualify as “open source”.

But this thread is about misuse of the term as applied to the weights package. Those of us who know what open source means should not continue to dilute the term by calling these LLMs by that term.

1 more reply

mupuff12341y ago

You don't need the data itself, but at least a reference to what was used, basically provide the entire blueprint to recreate it.

It's just like even for a true open source software you still need to bring your own hardware to run it on.

simondotau1y ago

You can't. But that's not an excuse to misuse the label.

a-dub1y ago

int_19h1y ago

All that said, just because we're taking a different, and quite possibly slower / less compute-efficient route, doesn't mean that we can't get to AGI in this way.

dragonwriter1y ago

> Our own brains are hardwired for language acquisition by evolution, so we can few-shot it when learning and get there much faster

int_19h1y ago

1 more reply

j / k navigate · click thread line to collapse