Diffusion Is Spectral Autoregression (opens in new tab)

(sander.ai)

223 pointsackbar031y ago62 comments

62 comments

41 comments · 13 top-level

This post reminded me of a conversation I had with my cousins about language and learning. It’s interesting how (most?) languages seem inherently sequential, while ideas and knowledge tend to have a more hierarchical structure, with a “base frequency” communicating the basic idea and higher frequency overtones adding the nuances. I wonder what implications this might have in teaching current LLMs to reason?

vanderZwan1y ago

> It’s interesting how (most?) languages seem inherently sequential, while ideas and knowledge tend to have a more hierarchical structure

Spoken and written languages are presented in a sequential medium. They still represent hierarchical trees in their structure though.

(Notable semi-exception to the linearity are the sign languages, which are are kinematic three-dimensional languages involving two hands, an entire upper body and facial expressions. While I don't speak it, I've read a bit about it, and apparently the most common error for non-deaf people who learn it is to make so-called "split verb" errors. That is to say: to sign in a linear fashion like one would with a spoken language, instead of making use of all the parallel communication options available)

wiz21c1y ago

In the movie Arrival, the aliens use a non sequential language.

actionfromafar1y ago

Hm, Italian speakers look like what you describe. :-)

1 more reply

euroderf1y ago

Statements can have high internal branching & nesting (clauses, referents, etc.) but it seems to hit the limits of the brain's pushdown stack pretty quickly.

vanderZwan1y ago

Now you're making me curious why people with ADHD (me included) tend to have a weird tendency for writing longer run-on sentences with commas, that on top of that use more parenthesis than average. Often nesting them, even. Because according to research our working memory is a little lower on average than neurotypicals, which seems to contradict this.

4 more replies

xtiansimon1y ago

> “… languages seem inherently sequential, while ideas and knowledge tend to have a more hierarchical structure…”

Careful with your musings, or you might start thinking semiotically!

Diachrony and synchrony

https://en.wikipedia.org/wiki/Diachrony_and_synchrony

nyanpasu641y ago· 5 in thread

> I won’t speculate about why images exhibit this behaviour and sound seemingly doesn’t, but it is certainly interesting (feel free to speculate away in the comments!).

Images have a large near-DC component (solid colors) and useful time-domain properties, while human hearing starts at ~20 Hz and the frequencies needed to understand speech range from 300-4 kHz (spitballing based on the bandwidth of analog phones).

What would happen if you built a diffusion model using pink noise to corrupt all coefficients simultaneously? Alternatively what if you used something other than noise (like a direct blur) for the model to reverse?

benanne1y ago

Thanks for reading! The paper that directly inspired this blog post actually investigates the latter (blurring as the corruption process): https://arxiv.org/abs/2206.13397

fjkdlsjflkds1y ago

The lack of semantics associated to DC (and near-DC) components in audio data is important, and a big difference compared to image data, no doubt.

I'm not sure this changes if you look at a cepstral representation (as suggested in the article). In this case, the DC component represents the (white) noise level in the raw audio space (i.e., the spectrum averaged over all frequencies), so it doesn't have strong semantics either (other than... "how noisy is the waveform?").

wrs1y ago

All four audio examples are human-made, so it makes sense they emphasize the frequency range that humans distinguish best. It would be interesting to compare with natural audio to see if there’s a distinction like that found in natural vs. manmade scenes in images. (Unfortunately there are increasingly few places on Earth you can find truly natural audio with no manmade sounds audible…)

jiggawatts1y ago

You could just generate the audio in frequency space, much like how MP3 style codecs encode the raw signal. This converts the purely 1D audio waveform into a 2D grid of values, which is more amenable to this type of diffusion-based generation.

psyq1231y ago

It is not really 1D - to perform any T/F transform (FFT, (M)DCT, etc.) you need a number of samples in the time domain, so you are essentially transforming 2D (intensity over time) to another 2D representation (magnitude or magnitude+phase over frequency) - this is why MP3 style codecs usually have multiple frame (or "window") lenghts, usually one longer for high frequency resolution and one shorter for high temporal resolution.

1 more reply

thho23i42343431y ago· 3 in thread

I don't mean to mean but: what is surprising about any of this ?

Joseph Fourier's solution to the heat-equation (linear diffusion) was in fact the origin of the FT. The high-freq coefficients decay (as -t^2 IIRC) in there; the reverse is also known to be "unstable" (numerically, and is singular from the equillibrium).

More over, the reformulation doesn't immediately reveal some computational speedup, or a better alternative formulation (which is usually a measure of how valuable it is epistemically).

(Edit: note that Heat-equation is more akin to the Fokker-Planck eqn, not actual Diffusion as an SDE as is used in Diffusion models).

aDyslecticCrow1y ago

> What is surprising about any of this?

Connections between fields drive new ideas. And this has especially been the case for recent AI progress. With the speed at which the field is moving, ideas that are obvious to some still have a significant chance of not being tried yet.

Just as the connection between the Kalman filter and RNN models or the significant similarities between back-propagation and the whole field of control theory. If it's truly not surprising, then that's just another reason to try it out if nobody else has.

Does everything always need to be immediately "useful"?

ackbar03OP1y ago

I think what's interesting about it is the inter-relation between different disciplines and how the ideas are connected. The connection between the heat-equation and the generative diffusion models we see to day, and its relation to the Fourier Transform would not have been immediately obvious to me.

joaogui11y ago

I mean you didn't mention autoregressive models anywhere in your comment, whereas the post is about the connection between diffusion and autoregressive modelling. Also it's a blog post, if it has figured out a speed-up or improved method it would probably have been a paper

magicalhippo1y ago· 3 in thread

Not my area, enjoyed the read. It reminded me of how you can decode a scaled-down version of a JPEG image by simply ignoring the higher-order DCT coefficients.

As such it seems the statement is that stable diffusion is like an autoregressive model which predicts the next set of higher-order FT coefficients from the lower-order ones.

Seems like this is something one could do with a "regular" autoregressive model, has this been tried? Seems obvious so I assume so, but curious how it compares.

benanne1y ago

Thanks for reading! Absolutely, I included a few references that explore that approach at the bottom of section 4 (last two paragraphs).

magicalhippo1y ago

Excellent, thanks, will check them out.

Had just finished watching the Physics of Language Models[1] talk, where they show how GPT2 models could learn non-trivial context-free grammars, as well as effectively do dynamic programming to an extent, so though it would be interesting to see how they performed in the spectral fine-graining task.

[1]: https://physics.allen-zhu.com/home

magicalhippo1y ago

> I included a few references that explore that approach at the bottom of section 4

Man, reading on mobile phone just ain't the same. Somehow managed to not catch then end of that section. The first reference, "Generating Images with Sparse Representations", is very close to what I had in mind.

shaunregenbaum1y ago· 3 in thread

This was a fascinating read. I wonder if anyone has done an analysis on the FT structures of various types of data from molecular structures to time series data. Are all domains different, or do they share patterns?

ackbar03OP1y ago

I guess the idea will be somewhat similar, going from coarse to fine details, such as for 3D structures.

Maybe the original author benanne could give his insight.

benanne1y ago

I'm not sure if frequency decomposition makes sense for anything that's not grid-structured, but there is certainly evidence that there is positive "transfer" between generative modelling tasks in vastly different domains, implying that there are some underlying universal statistics which occur in almost all data modalities that we care about.

That said, the gap between perceptual modalities (image, video, sound) and language is quite large in this regard, and probably also partially explains why we currently use different modelling paradigms for them.

riemannzeta1y ago

I can't tell if this is tongue in cheek or not...

jmmcd1y ago· 3 in thread

I was struck by the comparison between audio spectra and image spectra. Image spectra have a strong power law effect, but audio spectra have more power in middle bands. Why? One part of the issue is that the visual spectrum is very narrow (just 1 order of magnitude from red to blue) compared to audio (4 orders of magnitude from 20Hz to 20kHz).

But another issue not mentioned in the article is that in images we can zoom in/out arbitrarily. So the width of a pixel can change – it might be 1mm in one image, or 1cm in another, or 1m or 1km. Whereas in audio, the “width of a pixel” (the time between two audio samples) is a fixed amount of time – usually 1/44.1kHz, but even if it’s at a different sample rate, we would convert all images to have the same sample rate before training an NN. The equivalent of this for images would be rescaling all images so that a picture of a cat is say 100x100 pixels, while a picture of a tiger is 300x300.

Which, come to think of it, would be potentially an interesting thing to do.

seiferteric1y ago

> it might be 1mm in one image, or 1cm in another, or 1m or 1km.

Hmm, how does depth affect this? The further away something is in a picture, the more width the pixel represents since it's angular right?

jmmcd1y ago

Right. Or to put it the other way around, the same leaf might be 1 pixel wide in one image, and 100 pixels wide in another image.

jmmcd1y ago

> that the visual spectrum is very narrow (just 1 order of magnitude from red to blue) compared to audio (4 orders of magnitude from 20Hz to 20kHz)

I was talking nonsense here - confusing the visual spectrum of light from red to blue with the visual spectrum of images, as in "how quickly the image changes as you move across the image". The article illustrates the latter concept well.

nowayno5831y ago· 2 in thread

Intuitively, audio is way more sensitive to phase and persistence because of the time domain. So maybe audio models look more like video models instead of image models?

I'm not really sure how current video generating models work, but maybe we could get some insight into them by looking at how current audio models work?

I think we are looking at an auto regression of auto regressions of sorts, where each PSD + phase is used to output the next, right? Probably with different sized windows of persistence as "tokens". But I'm a way out of my depth here!

bartwr1y ago

It's the other way around - in hearing, phase is almost irrelevant. At medium frequencies, moving head by a few centimeters changes phase wand phase relationships of all frequencies - and we don't perceive it at all! Most audio synthesis methods work on variants of spectrograms and phase is approximated only later (mattering mostly for transients and rapid frequency content changes).

In images, scrambling phase yields a completely different image. A single edge will have the same spectral content as pink/brown~ish noise, but they look completely unlike one another.

nowayno5831y ago

Makes sense! My impression that phase matters from audio comes from when editing audio in a DAW or anything like that. We are very sensitive to sudden phase changes (which would be kind of like teleporting very fast from one point to another, from our heads point of view). Our ears kind of pick them up like sudden bursts of white noise (which also makes sense, given that they kind of look like an impulse when zoomed in a lot).

So when generating audio I think the next chunk needs to be continuous in phase to the last chunk, where in images a small discontinuity in phase would just result in a noisy patch in the image. That's why I think it should be somewhat like video models, where sudden, small phase changes from one frame to the next give that "AI graininess" that is so common in the current models

2 more replies

catgary1y ago· 2 in thread

I feel like Song et al characterized diffusion models as SDEs pretty unambiguously, and it connects to Optimal Transport in a pretty unambiguous manner. I understand the desire to give different perspectives, but once you start using multiple hedge words/qualitatives like:

> basically an approximate version of the Fourier transform!

You should take a step back and ask “am I actually muddying the water right now?”

benanne1y ago

Oof, you're not going to like this other blog post I wrote then :D https://sander.ai/2023/07/20/perspectives.html

catgary1y ago

Well, yeah, I don’t know what you expect me to say, it’s sloppy work.

1 more reply

WithinReason1y ago· 1 in thread

To me this means that you could significantly speed up image generation by using a lower resolution at the beginning of the generation process and gradually transitioning to higher resolutions. This would also help with the attention mechanism not getting overwhelmed when generating a high resolution image from scratch.

Also, you should probably enforce some kind of frequency cutoff later when you're generating the high frequencies so that you don't destroy low frequency details later in the process.

benanne1y ago

Thanks for reading! Check out subspace diffusion: https://arxiv.org/abs/2205.01490

HarHarVeryFunny1y ago

The high and low frequency components of speech are produced and perceived in different ways.

The lower frequencies (roughly below 4KHz) are created by the vocal chords opening and closing at the fundamental frequency, and harmonics of this fundamental frequency (e.g. 100Hz + 2/3/400Hz etc harmonics), with this frequency spectrum then being modulated by the resonances of the vocal tract which change during pronunciation. What we perceive as speech is primarily the changes to these resonances (aka formants) due to articulation/pronunciation.

The higher frequencies present in speech mostly comes from "white noise" created by the turbulence of forcing air out through closed teeth/etc (e.g. "S" sound), and our perception of these "fricative" speech sounds is based on onset/offset of energy in these higher 4-8KHz frequencies. Frequencies above 8KHz are not very perceptually relevant, and may be filtered out (e.g. not present in analog telephone speech).

theptip1y ago

> The RAPSD of Gaussian noise is also a straight line on a log-log plot; but a horizontal one, rather than one that slopes down. This reflects the fact that Gaussian noise contains all frequencies in equal measure

Huh. Does this mean that pink noise would be a better prior for diffusion models than Gaussian noise, as your denoiser doesn’t need to learn to adjust the overall distribution? Or is this shift in practice not a hard thing to learn in the scale of a training run?

slashdave1y ago

This has little to do with diffusion. The aspects described relate to images (and sound) and are true for VAE models, for example. I mean, what else is a UNet?

theo19961y ago

WEll yes econometrics and time series analyses had already described all the methods and functions for """AI"""", but marketing idiots decided t ocreate new names for 30 year old knowledge.

j / k navigate · click thread line to collapse

62 comments

41 comments · 13 top-level

andersbthuesen1y ago· 6 in thread

vanderZwan1y ago

> It’s interesting how (most?) languages seem inherently sequential, while ideas and knowledge tend to have a more hierarchical structure

Spoken and written languages are presented in a sequential medium. They still represent hierarchical trees in their structure though.

wiz21c1y ago

In the movie Arrival, the aliens use a non sequential language.

actionfromafar1y ago

Hm, Italian speakers look like what you describe. :-)

1 more reply

euroderf1y ago

Statements can have high internal branching & nesting (clauses, referents, etc.) but it seems to hit the limits of the brain's pushdown stack pretty quickly.

vanderZwan1y ago

4 more replies

xtiansimon1y ago

> “… languages seem inherently sequential, while ideas and knowledge tend to have a more hierarchical structure…”

Careful with your musings, or you might start thinking semiotically!

Diachrony and synchrony

https://en.wikipedia.org/wiki/Diachrony_and_synchrony

nyanpasu641y ago· 5 in thread

> I won’t speculate about why images exhibit this behaviour and sound seemingly doesn’t, but it is certainly interesting (feel free to speculate away in the comments!).

benanne1y ago

Thanks for reading! The paper that directly inspired this blog post actually investigates the latter (blurring as the corruption process): https://arxiv.org/abs/2206.13397

fjkdlsjflkds1y ago

The lack of semantics associated to DC (and near-DC) components in audio data is important, and a big difference compared to image data, no doubt.

wrs1y ago

jiggawatts1y ago

psyq1231y ago

1 more reply

thho23i42343431y ago· 3 in thread

I don't mean to mean but: what is surprising about any of this ?

More over, the reformulation doesn't immediately reveal some computational speedup, or a better alternative formulation (which is usually a measure of how valuable it is epistemically).

(Edit: note that Heat-equation is more akin to the Fokker-Planck eqn, not actual Diffusion as an SDE as is used in Diffusion models).

aDyslecticCrow1y ago

> What is surprising about any of this?

Does everything always need to be immediately "useful"?

ackbar03OP1y ago

joaogui11y ago

magicalhippo1y ago· 3 in thread

Not my area, enjoyed the read. It reminded me of how you can decode a scaled-down version of a JPEG image by simply ignoring the higher-order DCT coefficients.

As such it seems the statement is that stable diffusion is like an autoregressive model which predicts the next set of higher-order FT coefficients from the lower-order ones.

Seems like this is something one could do with a "regular" autoregressive model, has this been tried? Seems obvious so I assume so, but curious how it compares.

benanne1y ago

Thanks for reading! Absolutely, I included a few references that explore that approach at the bottom of section 4 (last two paragraphs).

magicalhippo1y ago

Excellent, thanks, will check them out.

[1]: https://physics.allen-zhu.com/home

magicalhippo1y ago

> I included a few references that explore that approach at the bottom of section 4

shaunregenbaum1y ago· 3 in thread

ackbar03OP1y ago

I guess the idea will be somewhat similar, going from coarse to fine details, such as for 3D structures.

Maybe the original author benanne could give his insight.

benanne1y ago

riemannzeta1y ago

I can't tell if this is tongue in cheek or not...

jmmcd1y ago· 3 in thread

Which, come to think of it, would be potentially an interesting thing to do.

seiferteric1y ago

> it might be 1mm in one image, or 1cm in another, or 1m or 1km.

Hmm, how does depth affect this? The further away something is in a picture, the more width the pixel represents since it's angular right?

jmmcd1y ago

Right. Or to put it the other way around, the same leaf might be 1 pixel wide in one image, and 100 pixels wide in another image.

jmmcd1y ago

> that the visual spectrum is very narrow (just 1 order of magnitude from red to blue) compared to audio (4 orders of magnitude from 20Hz to 20kHz)

nowayno5831y ago· 2 in thread

Intuitively, audio is way more sensitive to phase and persistence because of the time domain. So maybe audio models look more like video models instead of image models?

I'm not really sure how current video generating models work, but maybe we could get some insight into them by looking at how current audio models work?

bartwr1y ago

In images, scrambling phase yields a completely different image. A single edge will have the same spectral content as pink/brown~ish noise, but they look completely unlike one another.

nowayno5831y ago

2 more replies

catgary1y ago· 2 in thread

> basically an approximate version of the Fourier transform!

You should take a step back and ask “am I actually muddying the water right now?”

benanne1y ago

Oof, you're not going to like this other blog post I wrote then :D https://sander.ai/2023/07/20/perspectives.html

catgary1y ago

Well, yeah, I don’t know what you expect me to say, it’s sloppy work.

1 more reply

WithinReason1y ago· 1 in thread

Also, you should probably enforce some kind of frequency cutoff later when you're generating the high frequencies so that you don't destroy low frequency details later in the process.

benanne1y ago

Thanks for reading! Check out subspace diffusion: https://arxiv.org/abs/2205.01490

HarHarVeryFunny1y ago

The high and low frequency components of speech are produced and perceived in different ways.

theptip1y ago

slashdave1y ago

This has little to do with diffusion. The aspects described relate to images (and sound) and are true for VAE models, for example. I mean, what else is a UNet?

theo19961y ago

WEll yes econometrics and time series analyses had already described all the methods and functions for """AI"""", but marketing idiots decided t ocreate new names for 30 year old knowledge.

j / k navigate · click thread line to collapse