Don't ask me how they did that, it was close to magic to me at that time, but i'm sure it wasn't neural networks. Although it probably involved convolution, as it is the main tool for producing audio filters.
If anyone has more info on the fundamental differences of the neural network approach compared to the "traditional" one, i'd be thankful.
Say we have mixed three sources, let's call them L (panned hard left), C (center) and R (hard right). Then the left channel has +L+C, and the right channel has +R+C.
Now we phase invert one of them, say the right channel, and combine them. The new mono file is +L+C-R-C. +C-C cancels out and we're left with +L-R.
Since +R and -R essentially sounds the same, it sounds like if we had originally done a mono mix of L and R (+L+R).
But we can't combine this with a straight mono conversion (+L+C+R+C) in any way that will remove both L and R. All we can do is reproduce +L+C or +R+C.
This is a more sophisticated generalisation of the idea of inverting one channel and averaging in order to isolate the centre-panned vocal (or remove it for karaoke purposes).
It works well with mixes in which stereo placement is entirely by pan pot, adjusting the left/right volume levels of each instrument individually in order to place it on the stereo image. It doesn't work so well with real room recordings, where stereo placement is determined also by timing information (read e.g. https://www.audiocheck.net/audiotests_stereophonicsound.php)
This is a better-specified problem than monoaural source separation, which I think is what the original article here is about.
I like thinking of the "stereo image" of a track as having 3 dimensions in physical space. If you imagine looking forward at a stereo system, left to right is the "pan" (how much of a signal is originating from the left vs. right speakers, with 50% each being dead center), volume of each track (or instrument) is how distant (or close) the sound is (you can imagine an individual instrument moving forwards toward you if louder, and vice versa), and the frequency (or pitch) is how high the track is (with lower pitch or frequency being at the floor).
When a mixing engineer "mixes" a track, each track (or instrument) tends to be: 1) adjusted to an appropriate volume relative to other tracks (forward/backward), panned or spread left/right (usually more stereo "width" on the higher frequency sounds), and "equalized" to narrow and manipulate the band of frequencies coming through to the master mix from that track (it's common to cut out some of the harmonics either side of the fundamental of one instrument to leave "space" for another instrument competing for the same frequency range).
So now, even with a final, mastered track, we could apply various filters to de-mix fairly easily if the mixing engineer has left a good amount of "space" in between the elements in this stereo image. If our bass is always under 150 Hz and every other element is above 200 Hz, a simple low pass will grab the bass (and likely also the kick drum, if there is one). But we could also use a "gate" to only allow a signal of a certain magnitude (volume) through to isolate that kick hit, or the inverse to exclude the kick and just get the bass sound. A band-pass could do a decent job on a guitar or vocal track, but will also grab background noise from other tracks sharing the same frequency range. More complicated techniques could be used to isolate things that have been panned left or right of center based on comparison between the left and right stereo channels of the mix.
These techniques won't be as universally useful as the approach in this article, but for certain tracks or sections of tracks, can give very good results very quickly. More tracks and effects like reverb and distortion make all of this more difficult to do with simpler techniques.
The reality is, even instruments like kick drum or bass guitar have significant tonal content above 200hz, much of it overlapping with the guitar and vocal ranges.
From the description it looks like they used FastICA. https://en.wikipedia.org/wiki/FastICA
The traditional is faster and less recourse intensive. This approach is just showing that it can be done at this point.
For example, 2009 musical sound separation based on binary time frequency masking.
Or more recent stuff using deep learning. Also the field generally prefers ratio masks because they lead to better sounding output.
I want to say things were different back when we relied more on human librarians in searching for literature, but unfortunately history is full of cases where people independently discovered the same things as well.
https://academia.stackexchange.com/questions/9602/rediscover...
That said, if you are still at the point of inventing new terms for things people have been doing for decades, you are probably being fairly superficial in the area as well.
Research areas like CNN are especially prone to this because it is so much easier to apply the techniques than to understand the problem domain, and it generates a lot of low quality research papers. See also "when all you have is a hammer".
We figured somebody else had for sure done this and kept searching for different keywords until we picked up a bunch of papers with the correct terminology.
There are many, many historical recordings (and modern ones made in less-than-ideal circumstances) that suffer badly from reverb. Seems like a valuable use-case that -ought- to be in reach today.
Reverb elimination can be done without losses, just with distortions depending on the implementation. To do that, one have to recover cepstral[1] coefficients (with NN) and feed them to spectral filters (no NN needed).
This is feasible, provided somebody prepares a training data set consisting of lots of pairs (sound, same_sound_with_reverb), where sound would be a voice, instrument, applause, etc. and with a different reverb settings. Very likely you'll have to use enormous sample rates, way beyond 44100, because you're supposed to deal with infinitesimal impulse response... Adds up to hardware requirements.
I feel like I've oversimplified something, but it can be done, just lots of fidgeting with all the training sets and a training process itself.
For instance, Eric Humphrey at Spotify Music Understanding Group describes using a U-Net architecture here: https://medium.com/this-week-in-machine-learning-ai/separati... - paper at http://openaccess.city.ac.uk/19289/1/7bb8d1600fba70dd7940877...
They compare their performance to the widely-cited state of the art Chimera model (Luo 2017): https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5791533/#R24 with examples at http://naplab.ee.columbia.edu/ivs.html - from the examples, there's significantly less distortion than OP.
Not to discourage OP from doing first-principles research at all! But it's often useful to engage with the larger community and know what's succeeded and failed in the past. This is a problem domain where progress could change the entire creative landscape around derivative works ("mashups" and the like), and interested researchers could do well to look towards collaboration rather than reinventing each others' wheels.
EDIT: The SANE conference has talks by Humphrey and many others available online: https://www.youtube.com/channel/UCsdxfneC1EdPorDUq9_XUJA/vid...
For example if you Google then FASST is one of the ones that come up, but it's a whole framework and in order to use it you'd have to learn the research yourself; much of these software is not geared for end users.
Sums up towards data science pretty well.
(Or to put it another way - there is commercial music software released in the last year that lets you do this yourself now.)
https://www.youtube.com/watch?v=kEauVQv2Quc
https://www.izotope.com/en/products/repair-and-edit/rx/music...
That said, I think a deep learning approach will likely do a lot better (and be a lot easier to develop, imo)
Also, check out Google's Magenta project; it aims to use ML in various music / creativity projects.
I personally plan on doing a project that will involve audio source separation as well as sample classification; a good trick for analyzing audio data is to convert it into images (maybe with some additional pre-transformations applied, such as passing it through an audio filter that exaggerates human-perceived properties of sound) and then just use your run-of-the-mill, bog-standard, state-of-the-art image classifiers on the resulting audio spectrogram with some well-chosen training/validation sets.
[1] https://github.com/tzutalin/labelImg [2] https://en.wikipedia.org/wiki/Gated_recurrent_unit
Not gonna do that.
I'm not involved with them in any way, but I've been amazed with its ability to cancel out coffee-shop style noise.
Check out https://krisp.ai/technology/ - Mac/Windows. I wish they had Linux support!
Edit: Appears they don't have Windows support yet.
This is both fascinating and horrifying at the same time! I wonder if/when it would be possible to rewrite whole words in real time using a voice that sounds just like you.
This means they're expensive($300 headphones from Bose etc).
Do neural network make this simpler ?
And do you think they can be applied cheaply enough,say for $99 headphones ?
I assume this will sell really well, and justify creating a dedicated chip, with time.
Which will give entirely new meaning to 'lip synching'.
Seriously, of all instruments, things like vocals are one of the most heartbreaking to work on and learn. Born male and want to sound like a female singer? You’ll never be able to do that. Same applies for women who wish they had male singing voices. It simply isn’t possible (with the rare exception, I guess).
Or maybe you just don’t like your voices timbre. You can take lessons for years and learn to sing on pitch, and you can alter your vocal tone, but you can’t control every aspect that gives your voice it’s unique sound.
I guess sometimes you just have to be happy with what you have.
If you want to see how it's done it's shared source : https://github.com/GistNoesis/Wisteria/
Thanks
https://ccrma.stanford.edu/~jos/EE201/More_Recent_PhD_EEs_CC...
``Instantaneous and Frequency-Warped Signal Processing Techniques for Auditory Source Separation'' (1994)
A CNN should be able to back that out too, and do other things like regenerate a 3d space. In the right, high-fidelity, acoustic tracks could be the spatial information to reconstruct a stage and a performance. It would be neat/beautiful/(possibly very powerful) to back video out of audio in that way.
Basically a giant equalizer that allows you to dim or brighten each channel from multiple sources.
Alas so many projects, too little time!