I've always found it interesting that while that's fundamentally true in terms of information, my understanding is that we perceive things with far more resolution than the uncertainty principle would allow. Specifically, we're able to judge frequencies with far more accuracy than a fuzzy spectrogram would suggest.
From what I understand, our brain essentially performs a kind of "deconvolution" on the fuzzy frequency data to identify a far "sharper" and defined frequency, which is relatively straightforward since the frequency "spread" is a known quantity.
This works well most of the time because we correctly assume we're dealing with relatively isolated sound sources emanating a distinct fundamental with a distinct series of overtones.
Our perception can become innacurate when that assumption fails to hold, and so sounds merge or become indistinguishable, we hear beat tones that don't technically exist, our brain gives up trying to hear frequencies and classifies it all as noise, etc.
I've never come across audio spectrogram software that attempted to perform a frequency deconvolution in a way that roughly simulates what our own ears do, but I'd love to know if anyone else has and could point me to it.
Amir from AudioScienceReview did a good introductory video about the psychoacoustica as well as frequency response in general https://www.youtube.com/watch?v=TwGd0aMn1wE
With EMD, a phantom "beat frequency" would actually show up in the transform space.
I think the software you are looking for would have to be based on a machine learning rather than purely theory-based approach if its intended for use with natural sound signals.
EDIT: Scene_Cast2’s comment below says it’s Empirical Mode Decomposition, not autocorrelation.
Anyway, I see no reason why spectrograms have to be fuzzy… a wide window size can locate frequencies very precisely while smoothing out fast variations in amplitude, which sounds pretty similar to how we hear things.
(Interestingly, when analysing the voice, linguists tend to use the opposite: a narrow window size, which smears out frequencies making the resonance bands more obvious, while allowing visualisation of fast glottal vibrations.)
That's how many voice coding algorithms work, you try to find a digital filter that generates a sound that is as close as the original according to a perception based metric, then transmit the filter coefficients.
I don't remember the exact details, but if I'm not mistaken generating this sort of metric is really time consuming.
Below is my GPU-based CWT that's 50x slower than the JS-only version in the post above.
When you convert a spectrogram back into sound it sounds like crap, but then how does MP3 store the frequency information (and why can't we use that for visualizations)?
The math is beyond my understanding, can anyone give some kind of analogy maybe?
fft gives you the spectrum + the phase. if you only use the spectrum to resynthesise you're missing half the information. temporal domain <-> spectral domain is a 99.9999999% lossless (not 100% I believe because of floating-point shenanigans, but enough to not matter at all) transform in both directions.
MP3 does not have remarkable fidelity though. MP3, and my clone of it, suffers from time domain artifacts. Quantization in the frequency domain causes distortion in the time domain as well, negatively affecting high frequency transient sounds like cymbals. That is more noticeable. Newer generation codecs like AAC handle transients much better, but they are considerably more advanced, and often use different transforms like wavelet transform.
I'm not sure what you mean by converting the spectrogram to sound, but my guess is that the windowing done on the short-time Fourier transform is causing artifacts.
X[n] = F[x[k]][n/2] if (n even) else F[x'[k]][(n+1)/2]
With F[x[k]] the DFT of the time-domain signal x[k], x'[k] = x[k]·exp(2·pi·i·k·alpha) and this alpha some constant which yields a frequency-domain shift by 25Hz.
If so: How does this method compare to zero-padding the time-domain signal (i.e. sinc-interpolating the frequency domain)? It is an interesting concept, but alas it's not immediately clear to me how to analyze this...
Whether this is mathematically sound is another question. I presume that it is, for two reasons. First, FFT essentially convolves X with a bunch of sinusoids with frequencies from a fixed set: 0 Hz, 50 Hz, 100 Hz and so on. There's nothing wrong with manually convolving X with a 57.3 Hz sinusoid, it's just FFT isn't designed for this (it's designed for rapid computation). The other reason is that combining such shifted FFTs we get what looks almost exactly like a CWT (i.e. wavelet transform).
As for sinc-interpolation, I think it's mathematically equivalent. Say we shift the input X with Z[k] = exp(ik/N...) and get XZ. Then we transform it to FFT[XZ] = FFT[X] conv FFT[Z], so it's convolving FFT[X] with FFT[Z] where FFT[Z] is probably that sinc kernel. I certainly know from experiments that FFT of exp(2·pi·i·k·alpha) where alpha doesn't precisely align with the 1024 grid produces a fuzzy function with a max around alpha and a bell-shaped curved around it, the width of the curve depends on how precisely alpha fits into one of the 1024 grid points.
Look at Izotope RX.
Especially the Spectral Repair module might be what you are imagining, but it has a lot of interesting tools. This is from an older version: https://www.youtube.com/watch?v=vNtxg28wx_M
For free there is also Virtual ANS and https://www.fsynth.com on the more experimental side (conversion is done raw using additive synthesis, phase information is lost so sound quality is affected)
There’s a reason why nobody does this. (Other than avantgarde experimental composers maybe, but they are looking for cacophony)
It appears you're doing just that, but the time "width" is still readily apparent in many of the spectrograms, most obviously on the birdsong ones -- almost like a horizontal motion blur.
Would a deconvolution filter be able to meaningfully horizontally "deblur" the spectrograms? So the birdsongs didn't appear to be drawn with a wide-tip marker, but rather a ballpoint pen? So not just hi-res, but hi-focus.
I have some implementations here: https://github.com/Lichtso/CCWT https://github.com/Lichtso/WebSpectrogram
I also learned in that time that while you can extract a complex signal from a real one using the Hilbert transform, it's not quite the same, and I've always wondered if we could achieve better fidelity/encoding/compression by starting with quadrature signals. Never figured out quite why, since Shannon-Nyquist says you should be able to encode all information of a bandwidth f signal with 2f sample rate, but I suspect it has to do with the difference between ideal real number math and nonlinear, quantizing, imperfect ADCs.
Not sure how you'd actually get quadrature signals from sound waves or any wideband scalar signal (maybe record at far higher sampling rate to get more phase information, then downsample), but it's a fun thought experiment.
When you run two ADCs 90 degrees out of phase that introduces another source of error, due to timing jitter. There's no reason to bother doing this for audio signals because modern ADCs are more than capable of accurately sampling at audio rates.
Seeing the hi-res images only gives me no idea what kind of improvement this is showing...
@gbh444g Hope you could maybe add some lo-res versions :)
(Would also be cool to have audio clips next to each image as well, but that's less important.)
That seems misleading. First of all, how often do you take a 1024 sample FFT? In theory, you could calculate it every sample, in which case you have 60 pixels, but 48,000 times per second.
Secondly, you can make use of frame-over-frame phase information. If you are looking at signals with mostly periodic content in that 3 kHz band, the phase information can indicate how much the signal in a given band deviates from that band's center frequency.
If the signal is dead on the frequency, then the phase component is stable frame-over-frame; the value does not move. If the signal is off, the phase angle shifts, kind of like a CRT television that is out of vertical sync. Each frame finds catches the signal in a different phase compared to the previus frame due to the frequency drift. The farther the signal is from the FFT band's frequency, the faster the phase angle rotates.
If you analyze the movement of phase of the same bin between successive frames, you can get a higher resolution estimate of the frequency than what you might think is possible from the 50 Hz resolution of that bin.
What you can't resolve is the situation when multiple independent signals clash into that same frequency bin. The assumption has to holds that the the bin has caught one periodic signal.
For humans it's easier, there were a plenty of studies done in that regards and there is even a separate science field for studying the human sound perception - Psychoacoustics. Humans perceive sound in bands (a band is a range of frequencies), not separate frequencies. And the size of bands vary per frequency so that in the voice range it's more narrow than, for example, high frequencies. The FFT fits very nicely into that picture and codecs were designed considering the human perception.
As for animals, I don't know any studies in that regards. I would assume that the way of perception should be very similar to the one human has, at least on the mechanics level. As for the sensitivity and the size of bands as well as dynamic range - it's hard to say. I'd love to see some studies that dig into details there but it seems that it's very hard to do them. Animals don't give you a direct feedback.
[0] https://play.google.com/store/apps/details?id=de.tu_chemnitz..., https://apps.apple.com/us/app/birdnet/id1541842885
huh?
Sounds like a really crappy implementation of CWT. Besides this, the mother wavelet used was not specified, so maybe the author doesn't really know much about CWT.