Microsoft Kosmos-1: A Multimodal Large Language Model (opens in new tab)

(github.com)

228 pointssolarist3y ago104 comments

104 comments

58 comments · 12 top-level

ducktective3y ago· 25 in thread

It can even solve IQ tests...I mean, how much further are we moving the goal post?

Is there a model that can solve differential equations symbolically and numerically? Most of modern engineering just boils down to diff.eqs whether ordinary or partial. It's our current best method to reason about stuff and control them.

bootsmann3y ago

> It can even solve IQ tests...I mean, how much further are we moving the goal post?

The problem with test like this is that when trained on the existing big datasets (commoncrawl etc.), chances are the test is already in the input so the validation is not proper. Its the same thing with all the "AI beats SAT" headlines. The exercises for those very tests exist all over the internet already.

nayroclade3y ago

It's well documented that these models can solve variations of questions that are not found anywhere in their training set, and even entirely novel problems invented by prompters. Not with 100% success, but they can do it far with a rate far better than chance, so the idea that they're pulling responses from their training data is simply not correct.

2 more replies

espadrine3y ago

> I mean, how much further are we moving the goal post?

Look at it this way: humans don’t have BPE-encoded text as input to their brain. It is ALL visual input. For AGI, you would at least need to add audio input as well. And be driven by action and reward.

The learning capabilities of the brain are currently beyond the processing capabilities of current architectures. Just the notion of a model receiving only pixel data that contains a question and being able to output voice data that produces a correct answer, using no partial model trained on another corpus, is probably not tractable without significant improvements.

But the models can be very useful without being AGI!

sillysaurusx3y ago

AGI is closer to tokenization than you might think. I realized this recently when trying to do audio prediction.

There was recently a project called riffusion which generates spectrograms, then recovers audio from the spectrograms.

You might be tempted to apply this to predict speech. But speech isn’t like music. We’re communicating in language, using a sequence of tones. It’s why most speech codecs use linear predictive coding. Predicting the waveforms won’t get you anywhere; no semantic understanding of language.

So the next step up is to divide speech into a series of tones, and try to predict those sounds rather than raw waveforms.

Except… that’s literally tokenization. And there’s some evidence that this is precisely what our brains are doing.

2 more replies

pmontra3y ago

> It is ALL visual input.

And sound, taste, smell, touch.

1 more reply

didntreadarticl3y ago

Its crap at the visual Raven IQ test though, it scores 22% vs an algorithm that takes random guesses scoring 17%.

thenaturalist3y ago

I'd be cautiuous with such general statements given the rapid pace of development in this area.

Benchmark shelf lives aren't that long.

You ommitted the fact that tuning bumped it to 26% vs random.

Sure, questionable what effort is involved in that step, but at the same time, that hints to me that tuning will be the new baseline within the next 12-24 months.

1 more reply

nmarinov3y ago

Semi related, there's a (pretty good) course at OMSCS where the main project is building an agent to solve RPM problems: https://lucylabs.gatech.edu/kbai/spring-2023/project-overvie...

And quite a lot of papers about that: https://scholar.google.com/scholar?q=%22raven%27s+progressiv...

kalium-xyz3y ago

Bet you 5 bucks I can train one that gets 100%. Just gotta train it on the ravens answer key.

Y_Y3y ago

It's pretty big by any standards, but you may find the work of Gradshteyn and Ryzhik solves this problem nicely.

Jevon233y ago

>how much further are we moving the goal post?

My goalpost for AGI is when Microsoft can fire their entire engineering staff, replace them with AI, and not notice any decrease in productivity or quality of output.

This test is empirically verifiable (in principle). No need to argue over whether the AI scoring X% on Y assessment task is “truly” impressive or not.

staticautomatic3y ago

You mean that isn’t the Teams origin story?

1 more reply

gfodor3y ago

You’re confusing the goal - the goal here isn’t about the finish line but the point where people all concede that the finish line is actually reachable without any major, presently unthinkable advances.

1 more reply

Tepix3y ago

It wasn't very good at the IQ test. But yes, it is promising.

"Although there is still a large performance gap between the current model and the average level of adults, KOSMOS-1 demonstrates the potential of MLLMs to perform zero-shot nonverbal reasoning by aligning perception with language models."

lumost3y ago

ChatGPT does a great job on symbolic manipulation. You have to prompt it to show derivations however vs. discussing the topic at a high level.

scotty793y ago

Yeah, that could be good. I think LLMs will start to be really useful when they start to do math at human level. When this happens, the sky is the limit.

p1esk3y ago

What is human level for math? Terence Tao? Average American?

1 more reply

trobertson3y ago

There's no goalpost to move. Psychologists have been saying that IQ tests are of limited scope and utility for decades. Specifically, it is widely agreed that an IQ test is not a valid way "to assess intelligence in a broader sense".

https://en.wikipedia.org/wiki/Intelligence_quotient#Validity...

seydor3y ago

I prefer not focusing on games and benchmarks. Hopefully we ll get to robotics soon

brookst3y ago

Arnold was great in that documentary!

moffkalast3y ago

Roger roger.

RivieraKid3y ago

What goalpost specifically?

ducktective3y ago

Solving IQ tests which measure quantitative reasoning.

1 more reply

mhh__3y ago

Writing down the equations is the task for the AI.

pillarofkiller3y ago

yeah there is a model that can do differential equations

josalhor3y ago· 10 in thread

The examples in the paper are pretty impressive. There is an example of a windows 11 dialog image. The computer can figure out which button to press given the desired outcome of the user. If one where to take this model and scale it, I can see an advanced bot in <5 years navigating the web and doing work based on a text input of a human purely by visual means. Interesting times.

reset-password3y ago

I've been following tech long enough to know that as soon as the computer can figure out which button to press it's only going to click on ads, I guarantee it.

sebzim45003y ago

Isn't it trivial to make a computer click on ads though? Just run selenium, apply the filtering rules from adblock and then click on a random element which would be blocked.

1 more reply

joshspankit3y ago

and then that’s going to be met with MS making it “impossible” for bots to automate clicking on ads which will have the unintended consequence of making it harder to use for power users.

duringwork123y ago

That explains googles Ux understanding AI. It can tell in words what the next step on a form is.

IanCal3y ago

You might be interested in this: https://www.adept.ai/blog/act-1

moritonal3y ago

We can't stop it, but giving an AI unbridled access to the Internet is a terrible idea. Whether it's a misphrased question or an clever prompt hack; entire sites will be crushed by the sheer superhuman performance of it.

Hackernews will be just robots chatting to each other nudging towards the latest product-hunt.

2 more replies

freakynit3y ago

Holy tckin' cow!!! Is this real?

Captchas be damned now. Beating AI with AI. What a time to be alive.

ren_engineer3y ago

this seems like the future to me, a huge chunk of work will be able to be done by just talking to your computer and then automating the task. Society is really going to need to adapt, knowledge workers being replaced will be as big a change as the industrial revolution replacing many manual laborers

what's interesting is how will these systems be maintained when all the junior tier engineering work is replaced by AI? Companies don't like hiring junior engineers now, will be an even bigger gap before a junior engineer becomes net productive now. Plus people building stuff using AI without understanding how things work under the hood. Seems ripe for some 40K tier situation where we have tech priests running systems that nobody knows how to build from scratch anymore

novaRom3y ago

Talking to your computer? This will not last too long until another disruptive change will happen. How about computers will overtake those decisions you think you freely able to do right now. It is all running fast, accelerating actually.

ivanvanderbyl3y ago

I worked at a company a few years back that used YoLO trained on billions of UI screenshots to navigate the UI of any desktop application based on plain English instructions. It already exists.

PaulHoule3y ago· 3 in thread

I like this feature they are working on

https://arxiv.org/abs/2212.10554

as I'd say the most obvious limitation of today's transformers is the limited attention window. If you want ChatGPT to do a good job of summarizing a topic based on the literature the obvious thing is to feed a bunch of articles into it and ask it to summarize (how can you cite a paper you didn't read?) and that requires looking at maybe 400,000 - 4,000,000 tokens.

Similarly there is a place for a word embedding, a sentence embedding, a paragraph embedding, a chapter embedding, a book embedding, etc. but these have to be scalable and obviously the book embedding is bigger but I ought to be able to turn a query into a sentence embedding and somehow match it against larger document embeddings.

lupire3y ago

That seems an infinite road. Humans don't need to memorize every token in every context in order to learn. They extract patterns online as they go.

p1esk3y ago

feed a bunch of articles into it and ask it to summarize

A better way (that's how humans do it) is to first summarize each article, then feed the summaries to get an overview of the topic. This way there's no need to expand the attention window.

PaulHoule3y ago

I've thought about that one for a long time. A long time ago I was reading proceedings of TREC trying to understand why Google was so much better than the search engines I knew how to build. TREC is pretty depressing because you find that 95% of the things you might think would improve search rankings do not. Particularly before BM25 was developed people tried indexing sub documents and consolidating them and consistently struck out.

Since BERT came out there is a considerable literature of people struggling mightily to combine transformer representations of document parts into a whole that convinces me that one could spend a few lifetimes pushing a bubble around underneath that rug.

I think the best argument for your case is that people seem to get along just fine with a limited short term memory. I'd temper that with the observation that a person writing a summary is actually doing a multiple stage process in which their short term memory is attending to part of what they are writing, part of what they are reading, and they are building long term memory structures at the same time. So there is a lot going on.

In the sense that abstracts work well for information retrieval and that many of them would fit in the GPT attention window or only be a little bigger you could make the case that a fixed-size structure could be highly useful for IR.

On the other hand, many documents, such as scientific papers, are considerably bigger than the current attention window and direct summarization of a single document via transformer will still need a bigger window, more like 40,000 tokens.

A lot of things in the literature are complex, muddy, contradictory or all of the above. (Try a question like "What did Freud think about narcissism?" or "What is the clinical relevance of Bleuler's concept of ambivalence?" or "Tell me about cosmic inflation" or "What is the dark matter particle?")

Hard cases really do require matching up parts of document A with parts of document B and certainly having them in the same attention window would help an LLM do that in a natural way.

It might be completely impractical, not just because of computational scalability but possibly more fundamental scalability limits. (I'm not sure a person with a 10x bigger short term memory would really be able to solve problems better than the average person... There are transformers with a 500,000 token attention window today and they suck.)

There could be some procedure where you cut documents up into pieces in various ways, extract critical context from documents A and B and other literature and also put in the parts you want to critique against each other, or even match up different parts of the same document to do the same. Maybe a small attention window could still be used to decompose documents into knowledge graphs but it is by no means trivial to reason over a KG once you have it.

What I do know today is that I have documents >4096 tokens that I want to retrieve, cluster and classify right now and transformers were not up to the task in Feb 2023, and I am hoping for some progress soon that will help.

2 more replies

drKarl3y ago· 3 in thread

At Microsoft:

Hey why don't we call our new LLM Cosmos? That's taken by the Azure Cosmos DB guys Damn it... how about Kosmos-1 ?

Bajeezus3y ago

"Fun" fact: There is another common internal service at Microsoft called Cosmos, and it is also a database.

So now there is Cosmos, Cosmos DB, and Kosmos.

outside12343y ago

Isn't there also a batch processing system named... you guessed it... Cosmos?

5-3y ago

https://en.wikipedia.org/wiki/Kosmos_1

aegistudio3y ago· 2 in thread

Hmm... LLMs / MLLMs might be truly a unified input / output interface of a would-be AGI, I think.

Tepix3y ago

Yeah, check out the Lex Friedman podcast episode #333 around minute 52 where Andrew Karpathy talks about the OpenAI project "World of Bits" that did this.

https://youtu.be/cdiD-9MMpb0?t=3013

mkmk33y ago

And the work he's talking about: https://paperswithcode.com/paper/world-of-bits-an-open-domai...

tomp3y ago· 1 in thread

Is there a better page to link to? I cannot even see "Kosmos" on this page!

Edit: Ah, looks like this is the link to the paper: https://arxiv.org/abs/2302.14045

It was discussed yesterday: https://news.ycombinator.com/item?id=34965326

thenaturalist3y ago

Better link would have been the tweet, it includes the paper & GH repo: https://twitter.com/alphasignalai/status/1630651280019292161

RcouF1uZ4gsC3y ago· 1 in thread

I don’t trust any report of model performance from papers, unless there is a publicly accessible demo. It is way too easy to test things the model has trained on and for the model to then completely fall flat when used by people in the real world.

nwoli3y ago

The fb galactica model is a good example of this. Sounded really promising, impressive paper, lots of weights. But when you actually tried it it mostly produced garbage

nl3y ago· 1 in thread

It's worth noting that this is a comparatively small model (1.6B params from memory).

It'll be interesting what capabilities emerge as they grow that model capacity.

kenjackson3y ago

That’sa good point. There’s a paper that talks about the non-linear nature of these models. That is at some very large size they seem to show a leap in ability.

solaristOP3y ago

Paper: https://arXiv.org/abs/2302.14045

Examples: https://twitter.com/alphasignalai/status/1630651280019292161

naasking3y ago

Another one that looks even more compelling:

Multimodal Chain-of-Thought Reasoning in Language Models, https://arxiv.org/abs/2302.00923

By building in chain of thought and multimodal learning, this 1B parameter model beats GPT-3.5's 170B parameter model.

Karellen3y ago

Did anyone else initially read that as `Kosmos~1`, and wonder what the full name of the project was?

xfalcox3y ago

Anyone know if this will be an openly available model?

j / k navigate · click thread line to collapse

104 comments

58 comments · 12 top-level

ducktective3y ago· 25 in thread

It can even solve IQ tests...I mean, how much further are we moving the goal post?

bootsmann3y ago

> It can even solve IQ tests...I mean, how much further are we moving the goal post?

nayroclade3y ago

2 more replies

espadrine3y ago

> I mean, how much further are we moving the goal post?

But the models can be very useful without being AGI!

sillysaurusx3y ago

AGI is closer to tokenization than you might think. I realized this recently when trying to do audio prediction.

There was recently a project called riffusion which generates spectrograms, then recovers audio from the spectrograms.

So the next step up is to divide speech into a series of tones, and try to predict those sounds rather than raw waveforms.

Except… that’s literally tokenization. And there’s some evidence that this is precisely what our brains are doing.

2 more replies

pmontra3y ago

> It is ALL visual input.

And sound, taste, smell, touch.

1 more reply

didntreadarticl3y ago

Its crap at the visual Raven IQ test though, it scores 22% vs an algorithm that takes random guesses scoring 17%.

thenaturalist3y ago

I'd be cautiuous with such general statements given the rapid pace of development in this area.

Benchmark shelf lives aren't that long.

You ommitted the fact that tuning bumped it to 26% vs random.

Sure, questionable what effort is involved in that step, but at the same time, that hints to me that tuning will be the new baseline within the next 12-24 months.

1 more reply

nmarinov3y ago

Semi related, there's a (pretty good) course at OMSCS where the main project is building an agent to solve RPM problems: https://lucylabs.gatech.edu/kbai/spring-2023/project-overvie...

And quite a lot of papers about that: https://scholar.google.com/scholar?q=%22raven%27s+progressiv...

kalium-xyz3y ago

Bet you 5 bucks I can train one that gets 100%. Just gotta train it on the ravens answer key.

Y_Y3y ago

It's pretty big by any standards, but you may find the work of Gradshteyn and Ryzhik solves this problem nicely.

Jevon233y ago

>how much further are we moving the goal post?

My goalpost for AGI is when Microsoft can fire their entire engineering staff, replace them with AI, and not notice any decrease in productivity or quality of output.

This test is empirically verifiable (in principle). No need to argue over whether the AI scoring X% on Y assessment task is “truly” impressive or not.

staticautomatic3y ago

You mean that isn’t the Teams origin story?

1 more reply

gfodor3y ago

1 more reply

Tepix3y ago

It wasn't very good at the IQ test. But yes, it is promising.

lumost3y ago

ChatGPT does a great job on symbolic manipulation. You have to prompt it to show derivations however vs. discussing the topic at a high level.

scotty793y ago

Yeah, that could be good. I think LLMs will start to be really useful when they start to do math at human level. When this happens, the sky is the limit.

p1esk3y ago

What is human level for math? Terence Tao? Average American?

1 more reply

trobertson3y ago

https://en.wikipedia.org/wiki/Intelligence_quotient#Validity...

seydor3y ago

I prefer not focusing on games and benchmarks. Hopefully we ll get to robotics soon

brookst3y ago

Arnold was great in that documentary!

moffkalast3y ago

Roger roger.

RivieraKid3y ago

What goalpost specifically?

ducktective3y ago

Solving IQ tests which measure quantitative reasoning.

1 more reply

mhh__3y ago

Writing down the equations is the task for the AI.

pillarofkiller3y ago

yeah there is a model that can do differential equations

josalhor3y ago· 10 in thread

reset-password3y ago

I've been following tech long enough to know that as soon as the computer can figure out which button to press it's only going to click on ads, I guarantee it.

sebzim45003y ago

Isn't it trivial to make a computer click on ads though? Just run selenium, apply the filtering rules from adblock and then click on a random element which would be blocked.

1 more reply

joshspankit3y ago

and then that’s going to be met with MS making it “impossible” for bots to automate clicking on ads which will have the unintended consequence of making it harder to use for power users.

duringwork123y ago

That explains googles Ux understanding AI. It can tell in words what the next step on a form is.

IanCal3y ago

You might be interested in this: https://www.adept.ai/blog/act-1

moritonal3y ago

Hackernews will be just robots chatting to each other nudging towards the latest product-hunt.

2 more replies

freakynit3y ago

Holy tckin' cow!!! Is this real?

Captchas be damned now. Beating AI with AI. What a time to be alive.

ren_engineer3y ago

novaRom3y ago

ivanvanderbyl3y ago

I worked at a company a few years back that used YoLO trained on billions of UI screenshots to navigate the UI of any desktop application based on plain English instructions. It already exists.

PaulHoule3y ago· 3 in thread

I like this feature they are working on

https://arxiv.org/abs/2212.10554

lupire3y ago

That seems an infinite road. Humans don't need to memorize every token in every context in order to learn. They extract patterns online as they go.

p1esk3y ago

feed a bunch of articles into it and ask it to summarize

A better way (that's how humans do it) is to first summarize each article, then feed the summaries to get an overview of the topic. This way there's no need to expand the attention window.

PaulHoule3y ago

Hard cases really do require matching up parts of document A with parts of document B and certainly having them in the same attention window would help an LLM do that in a natural way.

2 more replies

drKarl3y ago· 3 in thread

At Microsoft:

Hey why don't we call our new LLM Cosmos? That's taken by the Azure Cosmos DB guys Damn it... how about Kosmos-1 ?

Bajeezus3y ago

"Fun" fact: There is another common internal service at Microsoft called Cosmos, and it is also a database.

So now there is Cosmos, Cosmos DB, and Kosmos.

outside12343y ago

Isn't there also a batch processing system named... you guessed it... Cosmos?

5-3y ago

https://en.wikipedia.org/wiki/Kosmos_1

aegistudio3y ago· 2 in thread

Hmm... LLMs / MLLMs might be truly a unified input / output interface of a would-be AGI, I think.

Tepix3y ago

Yeah, check out the Lex Friedman podcast episode #333 around minute 52 where Andrew Karpathy talks about the OpenAI project "World of Bits" that did this.

https://youtu.be/cdiD-9MMpb0?t=3013

mkmk33y ago

And the work he's talking about: https://paperswithcode.com/paper/world-of-bits-an-open-domai...

tomp3y ago· 1 in thread

Is there a better page to link to? I cannot even see "Kosmos" on this page!

Edit: Ah, looks like this is the link to the paper: https://arxiv.org/abs/2302.14045

It was discussed yesterday: https://news.ycombinator.com/item?id=34965326

thenaturalist3y ago

Better link would have been the tweet, it includes the paper & GH repo: https://twitter.com/alphasignalai/status/1630651280019292161

RcouF1uZ4gsC3y ago· 1 in thread

nwoli3y ago

The fb galactica model is a good example of this. Sounded really promising, impressive paper, lots of weights. But when you actually tried it it mostly produced garbage

nl3y ago· 1 in thread

It's worth noting that this is a comparatively small model (1.6B params from memory).

It'll be interesting what capabilities emerge as they grow that model capacity.

kenjackson3y ago

That’sa good point. There’s a paper that talks about the non-linear nature of these models. That is at some very large size they seem to show a leap in ability.

solaristOP3y ago

Paper: https://arXiv.org/abs/2302.14045

Examples: https://twitter.com/alphasignalai/status/1630651280019292161

naasking3y ago

Another one that looks even more compelling:

Multimodal Chain-of-Thought Reasoning in Language Models, https://arxiv.org/abs/2302.00923

By building in chain of thought and multimodal learning, this 1B parameter model beats GPT-3.5's 170B parameter model.

Karellen3y ago

Did anyone else initially read that as `Kosmos~1`, and wonder what the full name of the project was?

xfalcox3y ago

Anyone know if this will be an openly available model?

j / k navigate · click thread line to collapse