I really wish someone would make a header only C++ audio library, that would be soooo nice.
Couldn’t find anything about C++ in that article on a quick scan - feel free to correct me
The point of the article isn't about c++ or why it's a good language for doing this sort of thing, but I'm a real time graphics and game engine programmer, so it's my language of choice.
It's probably debatable, but I don't agree with the statement that shortnening the "sound" changes pitch. It depends on your representation of the sound. If you represent it as a function of amplitude vs time then scaling the time axis does change pitch.
This makes a sensational tone about a fallacy. No instrument plays sound faster or slower to make it shorter or longer.... It just stops playing it or doesn't. If one thinks about the phenomenon this way, it becomes natural why you cannot compress time, to play shorter sounds.
> I don't agree with the statement that shortnening the "sound" changes pitch. It depends on your representation of the sound. If you represent it as a function of amplitude vs time then scaling the time axis does change pitch.
The only relevant "representation" is digital audio, which by definition is encoded as amplitude over time regardless of encoding technique. To lengthen time without changing pitch or pitch without changing time requires manipulation of the audio data. That manipulation is either done by granular synthesis, or by utilizing a Fast Fourier Transform to decompose the audio into its component waveforms, changing the frequencies or shortening the wave components, and recomposing them back to a composite waveform. This article is about granular synthesis, which requires far less computation than FFT.
> No instrument plays sound faster or slower to make it shorter or longer....
Irrelevant. We aren't dealing with physical instruments, but with digital audio.
There is nothing in the least fallacious or sensational about this article.
One rough area for curve-fitting is white-noise-esque sounds (WNES) like the letter "s" or "h" and tambourines. The processor can perhaps detect if WNES exceed a threshold, and use other techniques such as SC instead.
It's roughly comparable to JPEG versus GIF images. JPEG is better (more faithful) at gradual shades while GIF is better at edges. A better compression algorithm perhaps would use each where it does best per given image. However, at the cost of algorithm complexity and compression/decompression processing time.
I'm surprised I don't see it mentioned here, but there's a rather interesting extension to this technique made by Paul Nasca[0], which midigates these artifact by (1)carefully choosing the size and placement of grains and (2)randomly changing the phase of each grain before recombining. You can see the algorithm here[1].
The results are absolutely incredible. You can end up slowing a sample down by 800% or more with no artifacts. For example, here[2] is the Windows 95 startup sound extended to be a little over 6 minutes long. The reverb you hear isn't added, that's just what is sounds like.
Also, if you didn't notice from the page, it's one of the default plug-ins in Audacity.
[0]: http://www.paulnasca.com/ [1]: http://www.paulnasca.com/algorithms-created-by-me#TOC-PaulSt... [2]: https://www.youtube.com/watch?v=FsJdplLB1Bs
The very stretched waveform did contain some audible artifacts, but I think other methods like FFT would introduce some as well.
This kind of trick works because our hearing is frequency-based. So the crucial thing is to preserve the frequencies and it is going to sound exactly the same.
Spatial mapping of frequencies in the human ear here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2394499/ (see fig 5.)
Trying this with an image for example wouldn't work, because our vision is sample-based. Imagine splitting an image in tiny fragments and repeating/interpolating them on top of one another.
It's not just about continuity. It also removes an entire set of concerns from the process.
For example-- suppose someone analyzes an audio recording, splits it into grains, then does some fancy re-organization based on the timbral content of the recording/grains.
Now suppose they are subjectively unhappy with the result. Perhaps it sounds "wimpy," "fluttery," or some other such vague complaint. Is that sound due to a) their process of re-organizing the grains, b) the quality of the original recording, c) the envelopes they used, or d) something else entirely?
If instead one uses grains which begin and end at zero, the answer can't be C because it doesn't exist. I can say that the quality sounds fine in the few examples I've heard that use this technique.
I'd imagine the reason the latter isn't used as often is because it's simply more difficult to program if each grain can be an arbitrary size (or at least not quantized).
Some more famous algorithms that work this way and are similarly easy to implement are TDHS and PSOLA. They all work in the time domain but find different ways to smooth out the discontinuities and to make more extreme shifts sound better.