Highres Spectrograms with the DFT Shift Theorem (opens in new tab)

(soundshader.github.io)

124 pointsssgh5y ago69 comments

69 comments

43 comments · 13 top-level

LeegleechN5y ago· 13 in thread

It's unfortunate that the article doesn't get into the fundamental limits of spectrogram resolution which are based on the famous uncertainty principle(https://en.wikipedia.org/wiki/Fourier_transform#Uncertainty_...). For example there is a fundamental tradeoff between frequency resolution and time resolution similar to the position/momentum tradeoff in quantum mechanics. The Continuous Wavelet Transform which is alluded to in the article is a way to tune that tradeoff by frequency bin to best align with human sound perception.

crazygringo5y ago

> there is a fundamental tradeoff between frequency resolution and time resolution

I've always found it interesting that while that's fundamentally true in terms of information, my understanding is that we perceive things with far more resolution than the uncertainty principle would allow. Specifically, we're able to judge frequencies with far more accuracy than a fuzzy spectrogram would suggest.

From what I understand, our brain essentially performs a kind of "deconvolution" on the fuzzy frequency data to identify a far "sharper" and defined frequency, which is relatively straightforward since the frequency "spread" is a known quantity.

This works well most of the time because we correctly assume we're dealing with relatively isolated sound sources emanating a distinct fundamental with a distinct series of overtones.

Our perception can become innacurate when that assumption fails to hold, and so sounds merge or become indistinguishable, we hear beat tones that don't technically exist, our brain gives up trying to hear frequencies and classifies it all as noise, etc.

I've never come across audio spectrogram software that attempted to perform a frequency deconvolution in a way that roughly simulates what our own ears do, but I'd love to know if anyone else has and could point me to it.

zihotki5y ago

Our brain and ears perceive sound not as separate frequencies but as band of frequencies. Humans are not able to differentiate frequencies that are very close to each other because of that. We can identify separate musical notes because they fit into separate bands.

Amir from AudioScienceReview did a good introductory video about the psychoacoustica as well as frequency response in general https://www.youtube.com/watch?v=TwGd0aMn1wE

Scene_Cast25y ago

Another part is that the human hearing is probably closer to Empirical Mode Decomposition (EMD) than a Fourier variant.

With EMD, a phantom "beat frequency" would actually show up in the transform space.

LeegleechN5y ago

The purely algorithmic way to do it is the Wigner-Ville distribution, but it isn't practical for complex sounds due to the quadratic explosion of interactions between all time-frequency components. For a small number of well-separated 'chirp' signals it can give you exact localization.

I think the software you are looking for would have to be based on a machine learning rather than purely theory-based approach if its intended for use with natural sound signals.

bradrn5y ago

I’m pretty sure I heard somewhere that the ear does autocorrelation rather than a Fourier transform, but I’m not sure how correct that is.

EDIT: Scene_Cast2’s comment below says it’s Empirical Mode Decomposition, not autocorrelation.

Anyway, I see no reason why spectrograms have to be fuzzy… a wide window size can locate frequencies very precisely while smoothing out fast variations in amplitude, which sounds pretty similar to how we hear things.

(Interestingly, when analysing the voice, linguists tend to use the opposite: a narrow window size, which smears out frequencies making the resonance bands more obvious, while allowing visualisation of fast glottal vibrations.)

carlosf5y ago

Search for voice coding.

That's how many voice coding algorithms work, you try to find a digital filter that generates a sound that is as close as the original according to a perception based metric, then transmit the filter coefficients.

I don't remember the exact details, but if I'm not mistaken generating this sort of metric is really time consuming.

1 more reply

gbh444g5y ago

I did experiment with CWT in past [1] and was disappointed, to be honest. Not only it's grossly slow and complicated, it hardly gives more fidelity than plain FFT, it has the "window problem" which makes the low freqs too blurred and the high freqs too sharp, and it has the "wrapping ends" problem that makes it necessary to pad the input (about 1 million samples at least) with sufficient zero padding on both ends, as otherwise the two ends will interfere with each other.

Below is my GPU-based CWT that's 50x slower than the JS-only version in the post above.

[1] https://soundshader.github.io/?s=cwt

andai5y ago

I've been wondering about the apparent contradiction between the limitations of spectrograms and the remarkable fidelity of MP3 files, which I thought operated along similar lines.

When you convert a spectrogram back into sound it sounds like crap, but then how does MP3 store the frequency information (and why can't we use that for visualizations)?

The math is beyond my understanding, can anyone give some kind of analogy maybe?

jcelerier5y ago

> When you convert a spectrogram back into sound it sounds like crap

fft gives you the spectrum + the phase. if you only use the spectrum to resynthesise you're missing half the information. temporal domain <-> spectral domain is a 99.9999999% lossless (not 100% I believe because of floating-point shenanigans, but enough to not matter at all) transform in both directions.

electriccello5y ago

I think the trouble you're running into is that a spectrogram discards phase information so it's not informationally complete, and impossible to perfectly invert. Basically, a Fourier Transform represents a sound as a series of many sound waves at different frequencies added together. In order to make a pretty picture, the phase is thrown away, and only the magnitude of each wave is shown. The trouble is, to go back to a pleasant/accurate sound, we need that phase information that is missing.

1 more reply

bad_username5y ago

I implemented a simple clone of mp3 and it was not that hard. If you do a discrete Fourier transform of the audio (in small overlapping windows), quantize the resulting coefficients, and compress them losslessly using the Huffman codes, you will end up with something not that far from mp3. The human ear is quite forgiving to the effects of quantization in frequency domain.

MP3 does not have remarkable fidelity though. MP3, and my clone of it, suffers from time domain artifacts. Quantization in the frequency domain causes distortion in the time domain as well, negatively affecting high frequency transient sounds like cymbals. That is more noticeable. Newer generation codecs like AAC handle transients much better, but they are considerably more advanced, and often use different transforms like wavelet transform.

gugagore5y ago

The general concepts are described here: https://en.m.wikipedia.org/wiki/Psychoacoustics

I'm not sure what you mean by converting the spectrogram to sound, but my guess is that the windowing done on the short-time Fourier transform is causing artifacts.

achillesheels5y ago

My hypothesis: it is stored magnetically (after all magnetic sinusoidals exist) and converted electrically once the mp3 is activated in time.

2 more replies

gbh444g5y ago· 7 in thread

Hello HN! Author here. I was thinking to call the post "The underappreciated complexity of musical sounds" but decided to stick with the DFT one as it would probably get more attention. This is a small discovery I came across this weekend. FFT-based spectrograms of musical instruments isn't a novel thing do, but I thought what if I do a super highres spectrogram with a continuum of freqencies, instead of the N fixed ones FFT gives. Turns out, FFT "supports" such frequency shifting by multiplying the input by a specially constructed complex exponent. As a result, I've found out that musical instruments produce sophisticated ornaments in between the harmonic levels.

cviilgan5y ago

Did I understand this correctly, what you are doing is essentially:

X[n] = F[x[k]][n/2] if (n even) else F[x'[k]][(n+1)/2]

With F[x[k]] the DFT of the time-domain signal x[k], x'[k] = x[k]·exp(2·pi·i·k·alpha) and this alpha some constant which yields a frequency-domain shift by 25Hz.

If so: How does this method compare to zero-padding the time-domain signal (i.e. sinc-interpolating the frequency domain)? It is an interesting concept, but alas it's not immediately clear to me how to analyze this...

gbh444g5y ago

This sounds about right. I assume your (n+1)/2 is really n+1/2. The idea, like you've said, is to get Y[k+1/2] values where Y = FFT[X].

Whether this is mathematically sound is another question. I presume that it is, for two reasons. First, FFT essentially convolves X with a bunch of sinusoids with frequencies from a fixed set: 0 Hz, 50 Hz, 100 Hz and so on. There's nothing wrong with manually convolving X with a 57.3 Hz sinusoid, it's just FFT isn't designed for this (it's designed for rapid computation). The other reason is that combining such shifted FFTs we get what looks almost exactly like a CWT (i.e. wavelet transform).

As for sinc-interpolation, I think it's mathematically equivalent. Say we shift the input X with Z[k] = exp(ik/N...) and get XZ. Then we transform it to FFT[XZ] = FFT[X] conv FFT[Z], so it's convolving FFT[X] with FFT[Z] where FFT[Z] is probably that sinc kernel. I certainly know from experiments that FFT of exp(2·pi·i·k·alpha) where alpha doesn't precisely align with the 1024 grid produces a fuzzy function with a max around alpha and a bell-shaped curved around it, the width of the curve depends on how precisely alpha fits into one of the 1024 grid points.

1 more reply

CyberRabbi5y ago

My mind is kind of blown that birdsong virtually does not include higher harmonics. I didn’t even think that was possible for a physical resonator. Great post

akomtu5y ago

I think the mystery has a simple explanation: when a bird sings at 7 kHz and the mp3 file captures only first 20 kHz, there isn't much room for harmonics. Maybe birds do have interesting harmonics at 56 kHz, we just don't know.

ttoinou5y ago

maybe they were not captured by the bandiwth-limited microphone ?

Prcmaker5y ago

Thanks for writing this up, I'm always on the look out for alternative methods for DFTs and the like, currently concentrating on interpolation of low frequencies (after DC, but still within the first 5% of wave numbers) . I'll see if this fits my use case soon, hopefully today.

stainforth5y ago

What is an ornament?

efnx5y ago· 4 in thread

I love this and have been looking for a program that's like Photoshop for sound.

dkarras5y ago

>looking for a program that's like Photoshop for sound.

Look at Izotope RX.

Especially the Spectral Repair module might be what you are imagining, but it has a lot of interesting tools. This is from an older version: https://www.youtube.com/watch?v=vNtxg28wx_M

onirom5y ago

There is Photosounder which have excellent sound quality (all edit is in frequency domain and it then convert back)

For free there is also Virtual ANS and https://www.fsynth.com on the more experimental side (conversion is done raw using additive synthesis, phase information is lost so sound quality is affected)

layoutIfNeeded5y ago

You can try interpreting images as spectrograms, but the result will be a cacophonic mess.

There’s a reason why nobody does this. (Other than avantgarde experimental composers maybe, but they are looking for cacophony)

danwills5y ago

Most often yields a cacophonous mess, it's true - but it really depends how carefully the image is made. If the image was made using a fullly-detail fft of a sound then you could in theory get the exact same sound back out that you put in I reckon! :) (admittedly negative color values will be required (or clever mapping), and spreading time-in-the-sound over space-in-the-image).

1 more reply

crazygringo5y ago· 2 in thread

> Smoothness in the time direction is easier to achieve: the 1024 bins window can be advanced by arbitrarily small time steps.

It appears you're doing just that, but the time "width" is still readily apparent in many of the spectrograms, most obviously on the birdsong ones -- almost like a horizontal motion blur.

Would a deconvolution filter be able to meaningfully horizontally "deblur" the spectrograms? So the birdsongs didn't appear to be drawn with a wide-tip marker, but rather a ballpoint pen? So not just hi-res, but hi-focus.

malka5y ago

This might be relevant: https://ccrma.stanford.edu/~juhan/super_spec.html

crazygringo5y ago

Thank you for that, that is fascinating!

Lichtso5y ago· 2 in thread

On that note, also checkout wavelets to generate spectrograms: https://en.wikipedia.org/wiki/Wavelet

I have some implementations here: https://github.com/Lichtso/CCWT https://github.com/Lichtso/WebSpectrogram

kortex5y ago

This is fantastic! About 5 years ago (just before this repo was made it seems) I was doing a ton of stuff with EEG analysis with python. Used CWTs a ton but it was slooow, even with lots of numpy tricks. This would have been super handy.

I also learned in that time that while you can extract a complex signal from a real one using the Hilbert transform, it's not quite the same, and I've always wondered if we could achieve better fidelity/encoding/compression by starting with quadrature signals. Never figured out quite why, since Shannon-Nyquist says you should be able to encode all information of a bandwidth f signal with 2f sample rate, but I suspect it has to do with the difference between ideal real number math and nonlinear, quantizing, imperfect ADCs.

Not sure how you'd actually get quadrature signals from sound waves or any wideband scalar signal (maybe record at far higher sampling rate to get more phase information, then downsample), but it's a fun thought experiment.

krapht5y ago

You can convert to quadrature by either sampling a signal at >= 2 * Nyquist and using the Hilbert transform, or using two ADCs 90 degrees out of phase.

When you run two ADCs 90 degrees out of phase that introduces another source of error, due to timing jitter. There's no reason to bother doing this for audio signals because modern ADCs are more than capable of accurately sampling at audio rates.

crazygringo5y ago· 1 in thread

This looks cool! But really needs "before" and "after" comparison images -- lo-res vs hi-res.

Seeing the hi-res images only gives me no idea what kind of improvement this is showing...

@gbh444g Hope you could maybe add some lo-res versions :)

(Would also be cool to have audio clips next to each image as well, but that's less important.)

swiley5y ago

It had a bit of that for the bird songs.

bobowzki5y ago· 1 in thread

The spectrograms on this site have a lot of spectral leakage. This can be improved a lot by applying a window function (blackman, hanning etc). It doesn't seem like the author does this.

ssghOP5y ago

Applying the Hann window function eliminates all the spectral leakage, but it also makes the image rather dull and precise, very similar to CWT. You've made me realise that the intricate patterns seen on violin spectragrams are the result of interference of the spectral leakage from main harmonics. It doesn't mean the patterns are fake. It means the patterns emerge only when the input sound is transformed a certain way (FFT with the rectangular window).

kazinator5y ago

> A typical FFT-based spectrogram uses 1024 bins on a 48 kHz audio, with about 50 Hz step per pixel. Most of the interesting audio activity happens below 3 kHz, so 50 Hz per pixel gives only 60 pixels for that area.

That seems misleading. First of all, how often do you take a 1024 sample FFT? In theory, you could calculate it every sample, in which case you have 60 pixels, but 48,000 times per second.

Secondly, you can make use of frame-over-frame phase information. If you are looking at signals with mostly periodic content in that 3 kHz band, the phase information can indicate how much the signal in a given band deviates from that band's center frequency.

If the signal is dead on the frequency, then the phase component is stable frame-over-frame; the value does not move. If the signal is off, the phase angle shifts, kind of like a CRT television that is out of vertical sync. Each frame finds catches the signal in a different phase compared to the previus frame due to the frequency drift. The farther the signal is from the FFT band's frequency, the faster the phase angle rotates.

If you analyze the movement of phase of the same bin between successive frames, you can get a higher resolution estimate of the frequency than what you might think is possible from the 50 Hz resolution of that bin.

What you can't resolve is the situation when multiple independent signals clash into that same frequency bin. The assumption has to holds that the the bin has caught one periodic signal.

zihotki5y ago

I wonder how can we make assumptions about the bird songs while not taking into account how birds perceive the sound.

For humans it's easier, there were a plenty of studies done in that regards and there is even a separate science field for studying the human sound perception - Psychoacoustics. Humans perceive sound in bands (a band is a range of frequencies), not separate frequencies. And the size of bands vary per frequency so that in the voice range it's more narrow than, for example, high frequencies. The FFT fits very nicely into that picture and codecs were designed considering the human perception.

As for animals, I don't know any studies in that regards. I would assume that the way of perception should be very similar to the one human has, at least on the mechanics level. As for the sensitivity and the size of bands as well as dynamic range - it's hard to say. I'd love to see some studies that dig into details there but it seems that it's very hard to do them. Animals don't give you a direct feedback.

neogodless5y ago

Some of my family and I have been enjoying playing with the BirdNET[0] app which seems to use the ideas presented here to identify birds from recordings, utilizing machine learning.

[0] https://play.google.com/store/apps/details?id=de.tu_chemnitz..., https://apps.apple.com/us/app/birdnet/id1541842885

andai5y ago

Just a heads up, you have to click the images to see the full resolution version! I spent a good while confused about not being able to see the details mentioned in the images.

tantalor5y ago

> as if birds “draw” with sound something that’s flying backwards in time

huh?

jmpeax5y ago

> Despite this CWT implementation runs on GPU and this “advanced” FFT runs on JS, CWT is about 50-100x slower.

Sounds like a really crappy implementation of CWT. Besides this, the mother wavelet used was not specified, so maybe the author doesn't really know much about CWT.

j / k navigate · click thread line to collapse

69 comments

43 comments · 13 top-level

LeegleechN5y ago· 13 in thread

crazygringo5y ago

> there is a fundamental tradeoff between frequency resolution and time resolution

This works well most of the time because we correctly assume we're dealing with relatively isolated sound sources emanating a distinct fundamental with a distinct series of overtones.

zihotki5y ago

Amir from AudioScienceReview did a good introductory video about the psychoacoustica as well as frequency response in general https://www.youtube.com/watch?v=TwGd0aMn1wE

Scene_Cast25y ago

Another part is that the human hearing is probably closer to Empirical Mode Decomposition (EMD) than a Fourier variant.

With EMD, a phantom "beat frequency" would actually show up in the transform space.

LeegleechN5y ago

I think the software you are looking for would have to be based on a machine learning rather than purely theory-based approach if its intended for use with natural sound signals.

bradrn5y ago

I’m pretty sure I heard somewhere that the ear does autocorrelation rather than a Fourier transform, but I’m not sure how correct that is.

EDIT: Scene_Cast2’s comment below says it’s Empirical Mode Decomposition, not autocorrelation.

carlosf5y ago

Search for voice coding.

I don't remember the exact details, but if I'm not mistaken generating this sort of metric is really time consuming.

1 more reply

gbh444g5y ago

Below is my GPU-based CWT that's 50x slower than the JS-only version in the post above.

[1] https://soundshader.github.io/?s=cwt

andai5y ago

I've been wondering about the apparent contradiction between the limitations of spectrograms and the remarkable fidelity of MP3 files, which I thought operated along similar lines.

When you convert a spectrogram back into sound it sounds like crap, but then how does MP3 store the frequency information (and why can't we use that for visualizations)?

The math is beyond my understanding, can anyone give some kind of analogy maybe?

jcelerier5y ago

> When you convert a spectrogram back into sound it sounds like crap

electriccello5y ago

1 more reply

bad_username5y ago

gugagore5y ago

The general concepts are described here: https://en.m.wikipedia.org/wiki/Psychoacoustics

I'm not sure what you mean by converting the spectrogram to sound, but my guess is that the windowing done on the short-time Fourier transform is causing artifacts.

achillesheels5y ago

My hypothesis: it is stored magnetically (after all magnetic sinusoidals exist) and converted electrically once the mp3 is activated in time.

2 more replies

gbh444g5y ago· 7 in thread

cviilgan5y ago

Did I understand this correctly, what you are doing is essentially:

X[n] = F[x[k]][n/2] if (n even) else F[x'[k]][(n+1)/2]

With F[x[k]] the DFT of the time-domain signal x[k], x'[k] = x[k]·exp(2·pi·i·k·alpha) and this alpha some constant which yields a frequency-domain shift by 25Hz.

gbh444g5y ago

This sounds about right. I assume your (n+1)/2 is really n+1/2. The idea, like you've said, is to get Y[k+1/2] values where Y = FFT[X].

1 more reply

CyberRabbi5y ago

My mind is kind of blown that birdsong virtually does not include higher harmonics. I didn’t even think that was possible for a physical resonator. Great post

akomtu5y ago

ttoinou5y ago

maybe they were not captured by the bandiwth-limited microphone ?

Prcmaker5y ago

stainforth5y ago

What is an ornament?

efnx5y ago· 4 in thread

I love this and have been looking for a program that's like Photoshop for sound.

dkarras5y ago

>looking for a program that's like Photoshop for sound.

Look at Izotope RX.

Especially the Spectral Repair module might be what you are imagining, but it has a lot of interesting tools. This is from an older version: https://www.youtube.com/watch?v=vNtxg28wx_M

onirom5y ago

There is Photosounder which have excellent sound quality (all edit is in frequency domain and it then convert back)

For free there is also Virtual ANS and https://www.fsynth.com on the more experimental side (conversion is done raw using additive synthesis, phase information is lost so sound quality is affected)

layoutIfNeeded5y ago

You can try interpreting images as spectrograms, but the result will be a cacophonic mess.

There’s a reason why nobody does this. (Other than avantgarde experimental composers maybe, but they are looking for cacophony)

danwills5y ago

1 more reply

crazygringo5y ago· 2 in thread

> Smoothness in the time direction is easier to achieve: the 1024 bins window can be advanced by arbitrarily small time steps.

It appears you're doing just that, but the time "width" is still readily apparent in many of the spectrograms, most obviously on the birdsong ones -- almost like a horizontal motion blur.

malka5y ago

This might be relevant: https://ccrma.stanford.edu/~juhan/super_spec.html

crazygringo5y ago

Thank you for that, that is fascinating!

Lichtso5y ago· 2 in thread

On that note, also checkout wavelets to generate spectrograms: https://en.wikipedia.org/wiki/Wavelet

I have some implementations here: https://github.com/Lichtso/CCWT https://github.com/Lichtso/WebSpectrogram

kortex5y ago

krapht5y ago

You can convert to quadrature by either sampling a signal at >= 2 * Nyquist and using the Hilbert transform, or using two ADCs 90 degrees out of phase.

crazygringo5y ago· 1 in thread

This looks cool! But really needs "before" and "after" comparison images -- lo-res vs hi-res.

Seeing the hi-res images only gives me no idea what kind of improvement this is showing...

@gbh444g Hope you could maybe add some lo-res versions :)

(Would also be cool to have audio clips next to each image as well, but that's less important.)

swiley5y ago

It had a bit of that for the bird songs.

bobowzki5y ago· 1 in thread

The spectrograms on this site have a lot of spectral leakage. This can be improved a lot by applying a window function (blackman, hanning etc). It doesn't seem like the author does this.

ssghOP5y ago

kazinator5y ago

That seems misleading. First of all, how often do you take a 1024 sample FFT? In theory, you could calculate it every sample, in which case you have 60 pixels, but 48,000 times per second.

What you can't resolve is the situation when multiple independent signals clash into that same frequency bin. The assumption has to holds that the the bin has caught one periodic signal.

zihotki5y ago

I wonder how can we make assumptions about the bird songs while not taking into account how birds perceive the sound.

neogodless5y ago

Some of my family and I have been enjoying playing with the BirdNET[0] app which seems to use the ideas presented here to identify birds from recordings, utilizing machine learning.

[0] https://play.google.com/store/apps/details?id=de.tu_chemnitz..., https://apps.apple.com/us/app/birdnet/id1541842885

andai5y ago

Just a heads up, you have to click the images to see the full resolution version! I spent a good while confused about not being able to see the details mentioned in the images.

tantalor5y ago

> as if birds “draw” with sound something that’s flying backwards in time

huh?

jmpeax5y ago

> Despite this CWT implementation runs on GPU and this “advanced” FFT runs on JS, CWT is about 50-100x slower.

Sounds like a really crappy implementation of CWT. Besides this, the mother wavelet used was not specified, so maybe the author doesn't really know much about CWT.

j / k navigate · click thread line to collapse