Efficient high-resolution image synthesis with linear diffusion transformer (opens in new tab)

(nvlabs.github.io)

221 pointsVt71fcAqt71y ago44 comments

44 comments

23 comments · 9 top-level

cube22221y ago· 9 in thread

This looks like quite a huge breakthrough, unless I'm missing something?

~25x faster performance than Flux-dev, while offering comparable quality in benchmarks. And visually the examples (surely cherry-picked, but still) look great!

Especially since with GenAI the best way to get good results is to just generate a large amount of them and pick the best (imo). Performance like this will make that much easier/faster/cheaper.

Code is unfortunately "(Coming soon)" for now. Can't wait to play with it!

godelski1y ago

  > surely cherry-picked

As someone who works in generative vision, this is one of the most frustrating aspects (especially for those with less GPU resources). There's been a silent competition for picking the best images and not showing random results (even when there are random results they may be a selected batch). So it is hard to judge actual quality until you can play around.

Also, I'm not sure what laptop that is but they say 0.37s to generate a 1024x1024 image on a 4090. They also mention that it requires 16GB VRAM. But that laptop looks like a MSI Titan, which has a 4090, and correct me if I'm wrong, but I think the 4090 is the only mobile card with 16GB?[0] (I know desktop graphics have 16 for most cards). The laptop demo takes 4s to generate a 1024x1024 image. But they are chopped down quite a bit[1]

I wonder if that's with or without TensorRT

[0] https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_proces...

[1] https://gpu.userbenchmark.com/Compare/Nvidia-RTX-4090-Laptop...

bemmu1y ago

0.37s is only 11x away from realtime 30fps. I wonder if that will enable some cool new popular application for it besides batch image generation.

1 more reply

noduerme1y ago

Truthfully, I've had astonishing results from Stable Diffusion 1.4 on an M1 Mac, given the right inputs ...enough to throw my hands up and declare it a sort of magic (except for the presence of Getty Images watermarks randomly scattered around my results).

Nonetheless, as an art director, nothing I'd put into production. I guess that's because what I'm focused on is tickling the client base with something original.

2 more replies

zamadatix1y ago

The GeForce RTX 3080 Mobile and GeForce RTX 3080 Ti Mobile also have 16 GB versions as noted directly above the linked section on [0].

1 more reply

Lerc1y ago

>This looks like quite a huge breakthrough, unless I'm missing something?

Looking at their methodology, it seems like it's more of an accumulation of existing good ideas into the one model.

If it performs as well as they say, perhaps you can say the breakthrough is discovering just how much can be gained by combining recent advances.

It's sitting on just the edge of sounding too good to be true to me. I will certainly be pleased if it holds up to scrutiny.

liuliu1y ago

If you read closer to the benchmark, it seems to be slightly worse than FLUX [dev] on prompt adherence and quality. However, the best is to evaluate the result oneself, and the track-record of PixArt Sigma (from the same author?) is pretty good!

Archit3ch1y ago

If you generate 25x more images, you can afford to cherry-pick.

Lerc1y ago

That transfers computer time to user time. It's great when you want variations, less so when you want precision and consistency. Picking the best image tires the brain quite quickly, you have to take into account the at a glance quality without it overriding the detail quality.

I'd be curious to see how a vision model would go if it were finetuned to select the best image match to a given criteria.

It's possible that you could do O1 style training to build a final stage auto-cherrypicker.

cube22221y ago

It would be interesting to have benchmarks that take this into account (maybe they already do or I’m misunderstanding how those benchmarks work). I.e. when comparing quality between two different models of vastly different performance, you could be doing best-of-n in the faster model.

1 more reply

cynicalpeace1y ago· 3 in thread

None of this means much to me unless I can actually use it. Sorta like how Sora has been totally overshadowed by Kling, Runway, Minimax.

You have to release your model in some fashion for it to be impressive.

Agentlien1y ago

On the subject of such high quality video synthesis: have there been any such models which are actually available online? It strikes me that for image synthesis there have been a lot of amazing local models, but I can't remember seeing anything impressive for video which can be run offline.

bick_nyers1y ago

CogVideoX seems to be the best offline model so far

cynicalpeace1y ago

The highest quality ones I mentioned are available via API or web client, but that's enough for me to be happy.

cpldcpu1y ago· 1 in thread

>We introduce a new Autoencoder (AE) that aggressively increases the scaling factor to 32. Compared with AE-F8, our AE-F32 outputs 16× fewer latent tokens,

Basically they compress/decompress the images more, which means they need less computation during generation. But on the flip side this should mean less variability.

Isn't this more of a design trade-off than an optimization?

Lerc1y ago

It might not be compressing more (haven't yet looked at the paper). You can have fewer but larger tokens for the same amount of data.

It would decrease the workload by having fewer things to compare against balanced against workload per comparison. For normal N² that makes sense but the page says.

We introduce a new linear DiT, replacing vanilla quadratic attention and reducing complexity from O(N²) to O(N) Mix-FFN

So not sure what's up there.

amelius1y ago· 1 in thread

Does this finally solve the class of "6 fingers/hand" problems?

ttul1y ago

That problem can be fixed through careful fine-tuning, at the cost of losing some generality because the model is punished for drawing bad fingers. This new method outlined in the paper operates in a highly spatially-compressed latent space, but with more channels than previous models, so each latent pixel has 2x the information content than Flux and 8x the content of SDXL. I do wonder whether the high spatial compression means that high resolution features like fingers will be messed up. On the other hand, the higher channel count in the latent space gives the model more detail per pixel to work with… I guess we’ll just have to see.

ttul1y ago

There really are some “free lunches” in generative models. Really impressive work by this group. Ultimately, their model may not be the winner, because so much of what makes a good image gen model is the images and captioning that go into it, and the fine-tuning for aesthetic quality — something Midjourney and Flux both excel at. But the architecture here certainly will get into the hands of the people who can make the next great model.

Looking forward to it. This space just keeps getting more interesting.

lpasselin1y ago

This comes from the same group as the EfficientViT model. A few months ago, their EfficientViT model was the only modern and small ViT style model I could find that had raw pytorch code available. No dependencies to the shitty framework and libraries that other ViT are using.

echelon1y ago

Image models are going to be widely available. They'll probably be a dime a dozen soon. It's great that an increasing number of models are going open, because these are the ecosystems that will grow.

3D models (sculpts, texture, retopo, etc.) are following a similar trend and trajectory.

Open video models are lagging behind by several years. While CogVideo and Pyramid are promising, video models are petabyte scale and so much more costly to build and train.

I'm hoping video becomes free and cheap, but it's looking like we might be waiting a while.

Major kudos to all of the teams building and training open source models!

wiradikusuma1y ago

In my opinion, what's missing in these "image GenAI" tech is the ability to generate subsequent images consistently.

That would be useful for e.g. book illustration, comic strips, icon sets. Otherwise, people would think you pick those images all over the internet and not from one source/theme.

smusamashah1y ago

> (e.g. Flux-12B), being 20 times smaller and 100+ times faster in measured throughput. Moreover, Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024 × 1024 resolution image.

1 more reply

j / k navigate · click thread line to collapse

44 comments

23 comments · 9 top-level

cube22221y ago· 9 in thread

This looks like quite a huge breakthrough, unless I'm missing something?

~25x faster performance than Flux-dev, while offering comparable quality in benchmarks. And visually the examples (surely cherry-picked, but still) look great!

Especially since with GenAI the best way to get good results is to just generate a large amount of them and pick the best (imo). Performance like this will make that much easier/faster/cheaper.

Code is unfortunately "(Coming soon)" for now. Can't wait to play with it!

godelski1y ago

  > surely cherry-picked

I wonder if that's with or without TensorRT

[0] https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_proces...

[1] https://gpu.userbenchmark.com/Compare/Nvidia-RTX-4090-Laptop...

bemmu1y ago

0.37s is only 11x away from realtime 30fps. I wonder if that will enable some cool new popular application for it besides batch image generation.

1 more reply

noduerme1y ago

Nonetheless, as an art director, nothing I'd put into production. I guess that's because what I'm focused on is tickling the client base with something original.

2 more replies

zamadatix1y ago

The GeForce RTX 3080 Mobile and GeForce RTX 3080 Ti Mobile also have 16 GB versions as noted directly above the linked section on [0].

1 more reply

Lerc1y ago

>This looks like quite a huge breakthrough, unless I'm missing something?

Looking at their methodology, it seems like it's more of an accumulation of existing good ideas into the one model.

If it performs as well as they say, perhaps you can say the breakthrough is discovering just how much can be gained by combining recent advances.

It's sitting on just the edge of sounding too good to be true to me. I will certainly be pleased if it holds up to scrutiny.

liuliu1y ago

Archit3ch1y ago

If you generate 25x more images, you can afford to cherry-pick.

Lerc1y ago

I'd be curious to see how a vision model would go if it were finetuned to select the best image match to a given criteria.

It's possible that you could do O1 style training to build a final stage auto-cherrypicker.

cube22221y ago

1 more reply

cynicalpeace1y ago· 3 in thread

None of this means much to me unless I can actually use it. Sorta like how Sora has been totally overshadowed by Kling, Runway, Minimax.

You have to release your model in some fashion for it to be impressive.

Agentlien1y ago

bick_nyers1y ago

CogVideoX seems to be the best offline model so far

cynicalpeace1y ago

The highest quality ones I mentioned are available via API or web client, but that's enough for me to be happy.

cpldcpu1y ago· 1 in thread

>We introduce a new Autoencoder (AE) that aggressively increases the scaling factor to 32. Compared with AE-F8, our AE-F32 outputs 16× fewer latent tokens,

Basically they compress/decompress the images more, which means they need less computation during generation. But on the flip side this should mean less variability.

Isn't this more of a design trade-off than an optimization?

Lerc1y ago

It might not be compressing more (haven't yet looked at the paper). You can have fewer but larger tokens for the same amount of data.

It would decrease the workload by having fewer things to compare against balanced against workload per comparison. For normal N² that makes sense but the page says.

We introduce a new linear DiT, replacing vanilla quadratic attention and reducing complexity from O(N²) to O(N) Mix-FFN

So not sure what's up there.

amelius1y ago· 1 in thread

Does this finally solve the class of "6 fingers/hand" problems?

ttul1y ago

Looking forward to it. This space just keeps getting more interesting.

lpasselin1y ago

echelon1y ago

Image models are going to be widely available. They'll probably be a dime a dozen soon. It's great that an increasing number of models are going open, because these are the ecosystems that will grow.

3D models (sculpts, texture, retopo, etc.) are following a similar trend and trajectory.

Open video models are lagging behind by several years. While CogVideo and Pyramid are promising, video models are petabyte scale and so much more costly to build and train.

I'm hoping video becomes free and cheap, but it's looking like we might be waiting a while.

Major kudos to all of the teams building and training open source models!

wiradikusuma1y ago

In my opinion, what's missing in these "image GenAI" tech is the ability to generate subsequent images consistently.

That would be useful for e.g. book illustration, comic strips, icon sets. Otherwise, people would think you pick those images all over the internet and not from one source/theme.

smusamashah1y ago

1 more reply

j / k navigate · click thread line to collapse