I still hold the opinion that we’re going to need to move to spiking neuron (SNN) models in the future to keep growing the networks. Spiking networks require lots of storage, but a lot, lot less compute. They also propagate additional information in the _timing_ of the spikes, not just the values. There are a lot of low-hanging fruit in SNNs and I think people are still trying to copy biological systems too much.
Unfortunately, the main issue with SNNs is that no one has figured out a way to train them as effectively as ANNs.
As someone just trying to learn more about the implications of new research, I find myself resorting to /r/machinelearning, or even twitter threads, to get timely and informed discussions. That's a shame, given what HN sets out to be.
Maybe I'm expecting too much of HN, but I've seen these same two top level comments under myriad ML posts.
Sorry for the meta-discussion that's gotten us further away from this really remarkable paper.
One way or another we need a 1000x increase in efficiency to be able to run these models on edge hardware with full privacy and outside the control of the big corporations.
Funny that Gary Marcus is pleading on Twitter to get Dall-E 2 access in order to formulate his response. He isn't getting access yet. https://twitter.com/GaryMarcus/status/1513215530366234625
That kind of gate-keeping is possible because the costs of training and inferencing these models is too high today.
There is no chance in the near future consumers or Edge devices will be able to run these models locally, data is going to have to be fed back into the cloud.
Is this fundamental, or just a problem with mapping these models to our current serially-bottlenecked compute architectures? Could a move to “hyperconverged infrastructure in-the-small” — striping DRAM or NVMe and tiny RISC cores together on a die, where each CPU gets its own storage (or, you might say, where each small cluster of storage cells has its own tiny CPU attached), such that one stick has millions of independent+concurrent [+slow+memory-constrained] processors — resolve these difficulties?
I'm extremely optimistic about how transformers can recursively speed up progress in multiple areas of science. Transformers are reaching a point where they can demonstrate reasoning abilities within the ballpark of what you might expect from a human. For certain qualities, they far exceed what any human is capable of. One of those areas being depth of knowledge. Transformers (e.g. RETRO) can incorporate a library of knowledge far larger than any human can. Soon we will improve and harness this ability to the point where it may be pointless to create a scientific hypothesis without first "consulting" a large language model that is able to process the entire library of scientific publications.
Gpt-3 type models are very good at selecting for arbitrary qualities from among a list of options. Generating a list of 10 potential answers, then running prompts on the candidates to select for quality, accuracy, style, and so forth resembles the cyclic formulation of ideas in humans. The process used to generate essays and articles - draft, edit, revise, simplify, repeat until satisfied - can be implemented trivially. Those processes will transfer to larger models, and things like RETRO reduce resources by orders of magnitude.
Cognitive architecture seems to be an accurate descriptor of the use of multiple models and the logic layers for many-shot, many model development.
It may not be human level with zero-shot output, but how many humans produce human-level output in their stream-of-consciousness output? The act of consideration, recursing over an idea and refining it, is achievable with these models in a way that humans can debug and tweak cycle to cycle.
Multipass "consideration" and revision methodologies can capture almost any meta-cognitive processes used by humans, whether it's Socratic method or the AP style guide or an arbitrary jumble of rules derived from 4chan posters.
This type of methodology, doing meta-cognitive programming by linking together different models, is awesome. They're constructing low resolution imitations of brains - gpt-3 and BERT and the like can do things that no individual model can achieve. A predicate logic layer can document and explain decision history, and the other modules start to resemble something like the subconscious mind.
I think the next step in NLP will be a drastic innovation on today's learning model.
The Socratic paper is not about “higher intelligence”, it’s about demonstrating useful behaviour purely by connecting several large models via language.
"Stochastic parrot" is a derogatory term and I've never seen anyone who actually understands the technology use that phrase unironically. If anything, it's a shibboleth for bias or ignorance.
Anyone who thinks this REALLY doesn't know how language models work. A properly trained LM will only parrot something back because of lack of diversity in training data. This does happen in some cases (eg, GPL license or something) but those are pretty unique cases.
People on HN seem to think this a lot, but they are just wrong.
It's the first thing anyone learns, and it's easy to do.
It's really unfortunate, but that's why you see so many on HN that dismiss new technologies in ML (especially in NLP, since everyone can understand the output - that's less true in e.g. protein folding)
This is a pretty good insight.
> that's why you see so many on HN that dismiss new technologies in ML (especially in NLP, since everyone can understand the output
I think also in NLP people see output that is the same as some training data, so think it is copying it. It takes some a little bit more thought to think "ok if I asked 100 experts to try to write how to sort an array in Python" or their code is going to be very similar. This doesn't mean it is copied.
Overall this term says "limited to the intelligence of a parrot" which is false, models can solve math and coding problems, generate passable art, translate and speak in hundreds of languages and beat us at all board and card games. When was a parrot able to do that?
Neural networks can do math, but a lookup and memorized value model is structurally a lot different than a calculator model. The difference between them is a matter of weights for any given architecture. Tokenizing properly for math would help, but doing bit level tokenizing would be best, because that would allow multimodal domains to integrate more readily (i.e. audio/video/text models could share learned features more easily than if you are using parsed or domain specific tokens.) It's a great time to be alive.
To me, it is more proof of "stochastic parrot" behavior: model seen most of the available math information in internet, and even with significant computational power, can solve only 58% of elementary school level questions, and they were probably those with clear examples in training data, and can't generalize on those beyond.
The process kinda goes like this -
Think of ten answers to this question: blah blah blah
From these ten answers, which are the best 3?
Of the three answers, which is the best?
Revise and edit the best answer to be simpler or more understandable.
Prompt engineering is a nascent field, and we haven't seen nuanced or sophisticated use of the tool yet. Most of the metrics reported in papers are barely better than a naive Turing test. It doesn't take much introspection to know that even humans endlessly iterate and revise their output, and the best extemporaneous speech doesn't match well curated and edited material. It shouldn't surprise us that similar editing and revision processes will benefit transformer output.