Nvidia Launches a 100kb text-to-image model called Perfusion (opens in new tab)

(research.nvidia.com)

85 pointsenamya2y ago14 comments

14 comments

12 comments · 7 top-level

eis2y ago· 4 in thread

It's not a 100kb model. It's 100kb config files for a several GB model. A small trained layer to stick on top of the real model for fine tuning.

echelon2y ago

This looks like something between fine tuning a top layer and a zero shot approach.

This is probably what future voice models will begin to look like as they begin to capture prosody and other fine characteristics in a few hundred kb.

bogtog2y ago

Yes, although it is decently interesting that a model can be fine tuned by just tweaking a small number of weights and training for just a few minutes

eis2y ago

There is some meat to the story, I agree. But it's not surprising. The fine tuning model of course will be small in file size and not take too long to train because by definition it is applying changes to a small subset of the main model and is trained only on a small amount if input data. You can't use the small tuning model for "Teddies" with a query that has nothing to do with Teddies. You could see these small tuning models as a diff file for the main model. And depending on the user query one can choose an appropriate diff to be applied to improve the result for that specific query.

When you train a model with new inputs to fine tune you can save the weights that got changed to a separate file instead of the main file.

In other words one can see the small tuning models as selectively to be applied updates/patches.

brianjking2y ago

Isn't this just another method of a LoRa like what we've already seen in Stable Diffusion?

GaggiX2y ago· 1 in thread

@dang very misleading title and editorialized

Of course there is no 100kb text-to-image model

Terretta2y ago

From the white paper on Arxiv:

“This allows runtime-efficient balancing of visual-fidelity and textual-alignment with a single 100KB trained model, which is five orders of magnitude smaller than the current state of the art.”

https://arxiv.org/abs/2305.01644

It's one of those sentences that if you know what it means, you know what it means. That said, the title needs the word "personalization" inserted before the word model, e.g.:

Nvidia intros 100kb text-to-image personalization model called Perfusion

1 more reply

codemiscreant2y ago

* a several GB model to which a small amount of subject specific training can be performed on archetype, augmenting that large model

Doesn’t seem to be any code or runtime examples

thebruce87m2y ago

Yes, you need the pretrained model. BUT: for embedded applications, you could put that into metal and have the 100kb in flash which could open up some possibilities.

underdeserver2y ago

Maybe the title should be the "Key-Locked Rank One Editing for Text-to-Image Personalization", per HN guidelines.

redox992y ago

Very misleading title

glass-z132y ago

Very misleading title

j / k navigate · click thread line to collapse

14 comments

12 comments · 7 top-level

eis2y ago· 4 in thread

It's not a 100kb model. It's 100kb config files for a several GB model. A small trained layer to stick on top of the real model for fine tuning.

echelon2y ago

This looks like something between fine tuning a top layer and a zero shot approach.

This is probably what future voice models will begin to look like as they begin to capture prosody and other fine characteristics in a few hundred kb.

bogtog2y ago

Yes, although it is decently interesting that a model can be fine tuned by just tweaking a small number of weights and training for just a few minutes

eis2y ago

When you train a model with new inputs to fine tune you can save the weights that got changed to a separate file instead of the main file.

In other words one can see the small tuning models as selectively to be applied updates/patches.

brianjking2y ago

Isn't this just another method of a LoRa like what we've already seen in Stable Diffusion?

GaggiX2y ago· 1 in thread

@dang very misleading title and editorialized

Of course there is no 100kb text-to-image model

Terretta2y ago

From the white paper on Arxiv:

https://arxiv.org/abs/2305.01644

It's one of those sentences that if you know what it means, you know what it means. That said, the title needs the word "personalization" inserted before the word model, e.g.:

Nvidia intros 100kb text-to-image personalization model called Perfusion

1 more reply

codemiscreant2y ago

* a several GB model to which a small amount of subject specific training can be performed on archetype, augmenting that large model

Doesn’t seem to be any code or runtime examples

thebruce87m2y ago

Yes, you need the pretrained model. BUT: for embedded applications, you could put that into metal and have the 100kb in flash which could open up some possibilities.

underdeserver2y ago

Maybe the title should be the "Key-Locked Rank One Editing for Text-to-Image Personalization", per HN guidelines.

redox992y ago

Very misleading title

glass-z132y ago

Very misleading title

j / k navigate · click thread line to collapse