Is there a model that can solve differential equations symbolically and numerically? Most of modern engineering just boils down to diff.eqs whether ordinary or partial. It's our current best method to reason about stuff and control them.
The problem with test like this is that when trained on the existing big datasets (commoncrawl etc.), chances are the test is already in the input so the validation is not proper. Its the same thing with all the "AI beats SAT" headlines. The exercises for those very tests exist all over the internet already.
Look at it this way: humans don’t have BPE-encoded text as input to their brain. It is ALL visual input. For AGI, you would at least need to add audio input as well. And be driven by action and reward.
The learning capabilities of the brain are currently beyond the processing capabilities of current architectures. Just the notion of a model receiving only pixel data that contains a question and being able to output voice data that produces a correct answer, using no partial model trained on another corpus, is probably not tractable without significant improvements.
But the models can be very useful without being AGI!
There was recently a project called riffusion which generates spectrograms, then recovers audio from the spectrograms.
You might be tempted to apply this to predict speech. But speech isn’t like music. We’re communicating in language, using a sequence of tones. It’s why most speech codecs use linear predictive coding. Predicting the waveforms won’t get you anywhere; no semantic understanding of language.
So the next step up is to divide speech into a series of tones, and try to predict those sounds rather than raw waveforms.
Except… that’s literally tokenization. And there’s some evidence that this is precisely what our brains are doing.
Benchmark shelf lives aren't that long.
You ommitted the fact that tuning bumped it to 26% vs random.
Sure, questionable what effort is involved in that step, but at the same time, that hints to me that tuning will be the new baseline within the next 12-24 months.
And quite a lot of papers about that: https://scholar.google.com/scholar?q=%22raven%27s+progressiv...
My goalpost for AGI is when Microsoft can fire their entire engineering staff, replace them with AI, and not notice any decrease in productivity or quality of output.
This test is empirically verifiable (in principle). No need to argue over whether the AI scoring X% on Y assessment task is “truly” impressive or not.
"Although there is still a large performance gap between the current model and the average level of adults, KOSMOS-1 demonstrates the potential of MLLMs to perform zero-shot nonverbal reasoning by aligning perception with language models."
https://en.wikipedia.org/wiki/Intelligence_quotient#Validity...
Hackernews will be just robots chatting to each other nudging towards the latest product-hunt.
Captchas be damned now. Beating AI with AI. What a time to be alive.
what's interesting is how will these systems be maintained when all the junior tier engineering work is replaced by AI? Companies don't like hiring junior engineers now, will be an even bigger gap before a junior engineer becomes net productive now. Plus people building stuff using AI without understanding how things work under the hood. Seems ripe for some 40K tier situation where we have tech priests running systems that nobody knows how to build from scratch anymore
https://arxiv.org/abs/2212.10554
as I'd say the most obvious limitation of today's transformers is the limited attention window. If you want ChatGPT to do a good job of summarizing a topic based on the literature the obvious thing is to feed a bunch of articles into it and ask it to summarize (how can you cite a paper you didn't read?) and that requires looking at maybe 400,000 - 4,000,000 tokens.
Similarly there is a place for a word embedding, a sentence embedding, a paragraph embedding, a chapter embedding, a book embedding, etc. but these have to be scalable and obviously the book embedding is bigger but I ought to be able to turn a query into a sentence embedding and somehow match it against larger document embeddings.
A better way (that's how humans do it) is to first summarize each article, then feed the summaries to get an overview of the topic. This way there's no need to expand the attention window.
Since BERT came out there is a considerable literature of people struggling mightily to combine transformer representations of document parts into a whole that convinces me that one could spend a few lifetimes pushing a bubble around underneath that rug.
I think the best argument for your case is that people seem to get along just fine with a limited short term memory. I'd temper that with the observation that a person writing a summary is actually doing a multiple stage process in which their short term memory is attending to part of what they are writing, part of what they are reading, and they are building long term memory structures at the same time. So there is a lot going on.
In the sense that abstracts work well for information retrieval and that many of them would fit in the GPT attention window or only be a little bigger you could make the case that a fixed-size structure could be highly useful for IR.
On the other hand, many documents, such as scientific papers, are considerably bigger than the current attention window and direct summarization of a single document via transformer will still need a bigger window, more like 40,000 tokens.
A lot of things in the literature are complex, muddy, contradictory or all of the above. (Try a question like "What did Freud think about narcissism?" or "What is the clinical relevance of Bleuler's concept of ambivalence?" or "Tell me about cosmic inflation" or "What is the dark matter particle?")
Hard cases really do require matching up parts of document A with parts of document B and certainly having them in the same attention window would help an LLM do that in a natural way.
It might be completely impractical, not just because of computational scalability but possibly more fundamental scalability limits. (I'm not sure a person with a 10x bigger short term memory would really be able to solve problems better than the average person... There are transformers with a 500,000 token attention window today and they suck.)
There could be some procedure where you cut documents up into pieces in various ways, extract critical context from documents A and B and other literature and also put in the parts you want to critique against each other, or even match up different parts of the same document to do the same. Maybe a small attention window could still be used to decompose documents into knowledge graphs but it is by no means trivial to reason over a KG once you have it.
What I do know today is that I have documents >4096 tokens that I want to retrieve, cluster and classify right now and transformers were not up to the task in Feb 2023, and I am hoping for some progress soon that will help.
Hey why don't we call our new LLM Cosmos? That's taken by the Azure Cosmos DB guys Damn it... how about Kosmos-1 ?
So now there is Cosmos, Cosmos DB, and Kosmos.
Edit: Ah, looks like this is the link to the paper: https://arxiv.org/abs/2302.14045
It was discussed yesterday: https://news.ycombinator.com/item?id=34965326
It'll be interesting what capabilities emerge as they grow that model capacity.
Multimodal Chain-of-Thought Reasoning in Language Models, https://arxiv.org/abs/2302.00923
By building in chain of thought and multimodal learning, this 1B parameter model beats GPT-3.5's 170B parameter model.