undefined | Better HN

0 pointsValdikSS11h ago0 comments

That's why LLM will eventually be used only for initial interaction between the user in their language, to prepare the data to a specialized model.

Imagine face recognition to work like a text chat, where the PC gets the frame from the camera and writes in the chat: "Who's that? Here's the RGB888 image in hex: ...".

0 comments

FeepingCreature8h ago

That's actually how vision language models already work, pretty much.

wongarsu5h ago

And there's a reason nobody uses them for face recognition

Vision language models are an incredible achievement in the generality and usability. But they pay a hefty price in fidelity and speed

stingraycharles7h ago

Huh? The images are tokenized in the same way language is and it’s just fed into one single model. Not multiple smaller expert models.

Image gets rasterized into smaller pieces (eg 4x4 pixels) and each of those is assigned a token, similarly how text is broken up into tokens. And the whole thing is fed into a single model.

FeepingCreature6h ago

Yes I'm saying

> Imagine face recognition to work like a text chat, where the PC gets the frame from the camera and writes in the chat: "Who's that? Here's the RGB888 image in hex: ...".

that's p much how it works.

1 more reply

stingraycharles8h ago

Do you know that MoE is a thing?

jampekka7h ago

The experts in MoEs aren't specialized in any meaningful task sense. From level of what we would think as tasks MoEs are selected essentially arbitrarily per token and per block.

stingraycharles7h ago

It’s unsupervised, yes, but “unspecialized in any meaningful task sense” is incorrect, that’s the whole point. It’s just not in the sense of “this is a legal expert, this is a software developer”.

1 more reply

j / k navigate · click thread line to collapse

0 comments

FeepingCreature8h ago

That's actually how vision language models already work, pretty much.

wongarsu5h ago

And there's a reason nobody uses them for face recognition

Vision language models are an incredible achievement in the generality and usability. But they pay a hefty price in fidelity and speed

stingraycharles7h ago

Huh? The images are tokenized in the same way language is and it’s just fed into one single model. Not multiple smaller expert models.

Image gets rasterized into smaller pieces (eg 4x4 pixels) and each of those is assigned a token, similarly how text is broken up into tokens. And the whole thing is fed into a single model.

FeepingCreature6h ago

Yes I'm saying

> Imagine face recognition to work like a text chat, where the PC gets the frame from the camera and writes in the chat: "Who's that? Here's the RGB888 image in hex: ...".

that's p much how it works.

1 more reply

stingraycharles8h ago

Do you know that MoE is a thing?

jampekka7h ago

The experts in MoEs aren't specialized in any meaningful task sense. From level of what we would think as tasks MoEs are selected essentially arbitrarily per token and per block.

stingraycharles7h ago

1 more reply

j / k navigate · click thread line to collapse