Audio AI: isolating vocals from stereo music using Convolutional Neural Networks (opens in new tab)

(towardsdatascience.com)

275 pointsturbohz7y ago72 comments

72 comments

57 comments · 21 top-level

bsaul7y ago· 15 in thread

I used to work in an audio processing research center back in 2003, and colleagues next to me were able to isolate each instrument in a stereo mix live using the fact that they were "placed" on different spot in the stereo plane.

Don't ask me how they did that, it was close to magic to me at that time, but i'm sure it wasn't neural networks. Although it probably involved convolution, as it is the main tool for producing audio filters.

If anyone has more info on the fundamental differences of the neural network approach compared to the "traditional" one, i'd be thankful.

derekp77y ago

Back when I was a teen, I used to strip out the vocal tracks on stereo music, by disconnecting the speaker ground wires from the back of the amplifier and connecting them to each other (so each speaker would ground against the other one). Since the vocal was "center", it had the same waveform on both speakers, so the speakers couldn't make a sound if they couldn't dump the ground to something. And the instrumentals weren't distorted too much. At least it worked good enough for a cheap Karaoke setup.

robbrown4517y ago

I discovered this in the 80s when wires in to my car stereo speakers came loose and suddenly the vocal or lead instrument was missing from the music. Really puzzled me for a bit til I found the problem.

redsky177y ago

You can generally do this pretty closely digitally by taking the left and right channels of a stereo track and inverting the phase on one of them. Since the vocals are usually panned center (as you said), the inverted phase ends up destructively cancelling them. Depending on how the instrumentation was mixed, usually the instruments are left pretty in-tact.

1 more reply

heywire7y ago

I discovered something similar by accident, only with the line audio cables between by tape/CD and the amplifier. I would pull the plugs out just enough so the tip made connection but not the ring. It would give the same effect.

tomc19857y ago

There's a trick you can do to isolate vocals from some music by essentially flipping one of the stereo channels and combining their waveforms. All the stereo data cancels out and you're left with anything not panned hard center. Recombine that with the original stereo file converted to mono and you then get the vocals, usually a bunch of cruft from the reverb, and anything else panned hard center

urvader7y ago

Most smartphones shift phases on the left/right channels to make it sound better in headphones (simulating bigger room), this means you can combine the two channels (by connecting the positive wires or just wiggle a little in the headphone jack half way in) to extract the vocal track. Not as fancy as a CNN but I’m guessing it would be better to do something with the stereo information in preprocessing before training.

tobr7y ago

Removing the center is possible, but how exactly would you "recombine" it with the original to keep the center? Correct me if I'm wrong, but I don't think the math works out like that.

Say we have mixed three sources, let's call them L (panned hard left), C (center) and R (hard right). Then the left channel has +L+C, and the right channel has +R+C.

Now we phase invert one of them, say the right channel, and combine them. The new mono file is +L+C-R-C. +C-C cancels out and we're left with +L-R.

Since +R and -R essentially sounds the same, it sounds like if we had originally done a mono mix of L and R (+L+R).

But we can't combine this with a straight mono conversion (+L+C+R+C) in any way that will remove both L and R. All we can do is reproduce +L+C or +R+C.

1 more reply

skykooler7y ago

Interestingly, with a Nexus 5 and a certain pair of earbuds, this seems to happen automatically (minus the last step) - with most songs, I can hear the backing tracks, but no vocals. The earbuds are passive components, so I assume there's something going on electrically that's combining the stereo channels somehow.

bsmith7y ago

This works especially well for pop tunes, where the vocal track is by far the most prominent and is almost universally panned dead-center in the mix.

cannam7y ago

Almost certainly something like "Real-time Sound Source Separation: Azimuth Discrimination and Resynthesis" (Barry et al, https://arrow.dit.ie/argcon/35/)

This is a more sophisticated generalisation of the idea of inverting one channel and averaging in order to isolate the centre-panned vocal (or remove it for karaoke purposes).

It works well with mixes in which stereo placement is entirely by pan pot, adjusting the left/right volume levels of each instrument individually in order to place it on the stereo image. It doesn't work so well with real room recordings, where stereo placement is determined also by timing information (read e.g. https://www.audiocheck.net/audiotests_stereophonicsound.php)

This is a better-specified problem than monoaural source separation, which I think is what the original article here is about.

bsmith7y ago

I'm a fairly n00b-level hobbyist music producer, but I can take a stab.

I like thinking of the "stereo image" of a track as having 3 dimensions in physical space. If you imagine looking forward at a stereo system, left to right is the "pan" (how much of a signal is originating from the left vs. right speakers, with 50% each being dead center), volume of each track (or instrument) is how distant (or close) the sound is (you can imagine an individual instrument moving forwards toward you if louder, and vice versa), and the frequency (or pitch) is how high the track is (with lower pitch or frequency being at the floor).

When a mixing engineer "mixes" a track, each track (or instrument) tends to be: 1) adjusted to an appropriate volume relative to other tracks (forward/backward), panned or spread left/right (usually more stereo "width" on the higher frequency sounds), and "equalized" to narrow and manipulate the band of frequencies coming through to the master mix from that track (it's common to cut out some of the harmonics either side of the fundamental of one instrument to leave "space" for another instrument competing for the same frequency range).

So now, even with a final, mastered track, we could apply various filters to de-mix fairly easily if the mixing engineer has left a good amount of "space" in between the elements in this stereo image. If our bass is always under 150 Hz and every other element is above 200 Hz, a simple low pass will grab the bass (and likely also the kick drum, if there is one). But we could also use a "gate" to only allow a signal of a certain magnitude (volume) through to isolate that kick hit, or the inverse to exclude the kick and just get the bass sound. A band-pass could do a decent job on a guitar or vocal track, but will also grab background noise from other tracks sharing the same frequency range. More complicated techniques could be used to isolate things that have been panned left or right of center based on comparison between the left and right stereo channels of the mix.

These techniques won't be as universally useful as the approach in this article, but for certain tracks or sections of tracks, can give very good results very quickly. More tracks and effects like reverb and distortion make all of this more difficult to do with simpler techniques.

vonseel7y ago

I think you’re overestimating how independent each instruments section of the frequency spectrum is, even after mixing engineers cutting EQ to make places like you said.

The reality is, even instruments like kick drum or bass guitar have significant tonal content above 200hz, much of it overlapping with the guitar and vocal ranges.

nabla97y ago

>using the fact that they were "placed" on different spot in the stereo plane.

From the description it looks like they used FastICA. https://en.wikipedia.org/wiki/FastICA

The traditional is faster and less recourse intensive. This approach is just showing that it can be done at this point.

hooloovoo_zoo7y ago

Historically, one approach has been independent component analysis (https://en.wikipedia.org/wiki/Independent_component_analysis).

meatsock7y ago

you can compare the L and R channels to each other to separate signals out of a stereo field. https://cycling74.com/forums/separating-stereo-segments-poss...

emcq7y ago· 7 in thread

What motivates people to invent phrases like "perceptual binarization" when googling "audio binary mask" literally gives you citations in the field that have been doing this for years?

For example, 2009 musical sound separation based on binary time frequency masking.

Or more recent stuff using deep learning. Also the field generally prefers ratio masks because they lead to better sounding output.

avian7y ago

I know from my own experience that it's possible to dig yourself quite deep into some niche research field without realizing that there's an existing body of knowledge about it. If you or nobody else in your research circle knows the right keywords to enter into search fields it's really easy to overlook piles of published papers.

I want to say things were different back when we relied more on human librarians in searching for literature, but unfortunately history is full of cases where people independently discovered the same things as well.

brianberns7y ago

One extreme amusing/alarming example of this: “A Mathematical Model for the Determination of Total Area Under Glucose Tolerance and Other Metabolic Curves”

https://academia.stackexchange.com/questions/9602/rediscover...

ska7y ago

My approach to avoid this is always to try and find a recent and well written Masters or Ph.D thesis in the area. You can't always find them of course, but if you do they tend to have pretty good context and a more detailed bibliography than you'll find elsewhere.

That said, if you are still at the point of inventing new terms for things people have been doing for decades, you are probably being fairly superficial in the area as well.

Research areas like CNN are especially prone to this because it is so much easier to apply the techniques than to understand the problem domain, and it generates a lot of low quality research papers. See also "when all you have is a hammer".

1 more reply

evanweaver7y ago

Early on building https://faunadb.com/, it took us about a year to discover that the literature referred to "historical" or "time travel" queries as temporality. Other startups in our space were also using homemade jargon.

We figured somebody else had for sure done this and kept searching for different keywords until we picked up a bunch of papers with the correct terminology.

sorryforthethro7y ago

Also, the time investment to learn a congruent field's jargon is often much greater than just making up words and let the internet peanut gallery sort out the synonyms for you.

coredog647y ago

Maybe this is a Cunningham’s Law type attempt at finding prior art for a new patent.

zxcvvcxz7y ago

If you look smart and inventive you appear higher social status. Helps nerds impress women I reckon.

8bitsrule7y ago· 4 in thread

Question: has any progress been made in removing reverb?

There are many, many historical recordings (and modern ones made in less-than-ideal circumstances) that suffer badly from reverb. Seems like a valuable use-case that -ought- to be in reach today.

marzell7y ago

I don't have an answer for this. However, since they do have effective blur reduction/elimination techniques for visual images, I imagine that with enough resources we are not far from reverb/echo reduction in audio.

Eli_P7y ago

Blur elimination is usually done with unsharp mask, which works by blurring raster even more and comparing to the original. The output makes the edges more sharp but some information is lost anyway.

Reverb elimination can be done without losses, just with distortions depending on the implementation. To do that, one have to recover cepstral[1] coefficients (with NN) and feed them to spectral filters (no NN needed).

This is feasible, provided somebody prepares a training data set consisting of lots of pairs (sound, same_sound_with_reverb), where sound would be a voice, instrument, applause, etc. and with a different reverb settings. Very likely you'll have to use enormous sample rates, way beyond 44100, because you're supposed to deal with infinitesimal impulse response... Adds up to hardware requirements.

I feel like I've oversimplified something, but it can be done, just lots of fidgeting with all the training sets and a training process itself.

[1] https://en.wikipedia.org/wiki/Cepstrum

1 more reply

grav7y ago

iZotope RX 7, mentioned in a comment above, has something called de-reverb. I didn't try that particular feature, but I did try the music rebalance feature, and it's pretty impressive.

8bitsrule7y ago

Thanks. I'd call that progress. (Wasn't aware of this 'industry standard audio repair tool'. From WPedia, looks like that just arrived last Sept.)

btown7y ago· 3 in thread

This is an awesome project, but it seems it was done without reference to academic literature on source separation. In fact, people have been doing audio source separation for years with neural networks.

For instance, Eric Humphrey at Spotify Music Understanding Group describes using a U-Net architecture here: https://medium.com/this-week-in-machine-learning-ai/separati... - paper at http://openaccess.city.ac.uk/19289/1/7bb8d1600fba70dd7940877...

They compare their performance to the widely-cited state of the art Chimera model (Luo 2017): https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5791533/#R24 with examples at http://naplab.ee.columbia.edu/ivs.html - from the examples, there's significantly less distortion than OP.

Not to discourage OP from doing first-principles research at all! But it's often useful to engage with the larger community and know what's succeeded and failed in the past. This is a problem domain where progress could change the entire creative landscape around derivative works ("mashups" and the like), and interested researchers could do well to look towards collaboration rather than reinventing each others' wheels.

EDIT: The SANE conference has talks by Humphrey and many others available online: https://www.youtube.com/channel/UCsdxfneC1EdPorDUq9_XUJA/vid...

musicale7y ago

People have also been doing audio source separation effectively for years without neural networks.

calf7y ago

It's interesting cause I have a recording of human voices plus a background TV show that was too loud; I've looked around for something that would be able to separate the two but I haven't found a straightforward solution.

For example if you Google then FASST is one of the ones that come up, but it's a whole framework and in order to use it you'd have to learn the research yourself; much of these software is not geared for end users.

2 more replies

wodenokoto7y ago

> but it seems it was done without reference to academic literature on source separation

Sums up towards data science pretty well.

SyneRyder7y ago· 3 in thread

Does anyone know if this is related to the new iZotope RX 7 vocal isolation & stemming tools? It does seem to be talking about something similar, especially when it mentions using the same technique to split a song into instrument stems.

(Or to put it another way - there is commercial music software released in the last year that lets you do this yourself now.)

https://www.youtube.com/watch?v=kEauVQv2Quc

https://www.izotope.com/en/products/repair-and-edit/rx/music...

pizza7y ago

Going back further, X-Tracks did this ~5 years ago https://vimeo.com/107971872

That said, I think a deep learning approach will likely do a lot better (and be a lot easier to develop, imo)

Also, check out Google's Magenta project; it aims to use ML in various music / creativity projects.

I personally plan on doing a project that will involve audio source separation as well as sample classification; a good trick for analyzing audio data is to convert it into images (maybe with some additional pre-transformations applied, such as passing it through an audio filter that exaggerates human-perceived properties of sound) and then just use your run-of-the-mill, bog-standard, state-of-the-art image classifiers on the resulting audio spectrogram with some well-chosen training/validation sets.

Eli_P7y ago

That's interesting, are you going to make something like LabelImg[1]? I've been looking for something like that for audio, yet I'm not sure about treating audio as images. I've heard of this trick, but NN for audio better do work with RNN, GRU[2], maybe LSTM; and images are processed with CNN.

[1] https://github.com/tzutalin/labelImg [2] https://en.wikipedia.org/wiki/Gated_recurrent_unit

1 more reply

ambicapter7y ago

Do you have to convert it into an image? What is it about classifiers that require image input? I've always found it very cool that audio compression and image compression end up using similar frequency-space techniques sometimes.

1 more reply

sytelus7y ago· 1 in thread

Please stop publishing on Medium. I'm getting error "You read a lot. We like that. You’ve reached the end of your free member preview for this month. Become a member now for $5/month to read this story".

Not gonna do that.

computerex7y ago

Share this sparingly lest it gets "fixed" ;) https://outline.com/SauXTY

switchbak7y ago· 1 in thread

Just wanted to mention there's some folks doing realtime source separation (not sure exactly how they've implemented it) with a DNN for reduction of background noise in, eg: Skype conversations.

I'm not involved with them in any way, but I've been amazed with its ability to cancel out coffee-shop style noise.

Check out https://krisp.ai/technology/ - Mac/Windows. I wish they had Linux support!

Edit: Appears they don't have Windows support yet.

brucemoose7y ago

> Uninterrupted Voice The same krispNet DNN, trained on hundreds of hours of customized data, is able to perform Packet Loss Concealment (predicting lost network packets) for audio and fill out missing voice chunks by eliminating "chopping" in voice calls.

This is both fascinating and horrifying at the same time! I wonder if/when it would be possible to rewrite whole words in real time using a voice that sounds just like you.

petra7y ago· 1 in thread

Question: Currently building earphones with great active-noise-cancellation is a secret kept within few companies.

This means they're expensive($300 headphones from Bose etc).

Do neural network make this simpler ?

And do you think they can be applied cheaply enough,say for $99 headphones ?

I assume this will sell really well, and justify creating a dedicated chip, with time.

sonnyblarney7y ago

Doing any kind of neural net anything in realtime is usually not possible due to processing power requirements.

sonnyblarney7y ago· 1 in thread

Soon enough there will be an AI filter that will take any old hacky, coughing, wheezing singer running around on stage, singing out of tune - and turn it into virtuoso chops. Maybe even derived from their own voice.

Which will give entirely new meaning to 'lip synching'.

vonseel7y ago

I can’t wait.

Seriously, of all instruments, things like vocals are one of the most heartbreaking to work on and learn. Born male and want to sound like a female singer? You’ll never be able to do that. Same applies for women who wish they had male singing voices. It simply isn’t possible (with the rare exception, I guess).

Or maybe you just don’t like your voices timbre. You can take lessons for years and learn to sing on pitch, and you can alter your vocal tone, but you can’t control every aspect that gives your voice it’s unique sound.

I guess sometimes you just have to be happy with what you have.

GistNoesis7y ago

Hello, a little self promotion, you can see it our experiment with some deep neural networks doing real-time audio processing in the browser, using tensorflow.js

http://gistnoesis.github.io/

If you want to see how it's done it's shared source : https://github.com/GistNoesis/Wisteria/

Thanks

tasty_freeze7y ago

Trivia: Avery Wang, the guy who invented the Shazam algorithm and was their CTO did his PhD thesis on this topic:

https://ccrma.stanford.edu/~jos/EE201/More_Recent_PhD_EEs_CC...

``Instantaneous and Frequency-Warped Signal Processing Techniques for Auditory Source Separation'' (1994)

En_gr_Student7y ago

There is a lagged autoregressive technique used in forensic analysis that allows 3d reconstruction using 1d (mic) sound.

A CNN should be able to back that out too, and do other things like regenerate a 3d space. In the right, high-fidelity, acoustic tracks could be the spatial information to reconstruct a stage and a performance. It would be neat/beautiful/(possibly very powerful) to back video out of audio in that way.

plaidfuji7y ago

The presentation of this project alone is a visual tour de force to say nothing of the technical quality. Beautiful and easily digestible post. As with any interesting, non-toy applied ML problem, the dataset generation is really where the innovation is. It gets a neat little graphic at the end. As far as how the author characterizes the problem, I think the word he's looking for is "semantic segmentation" - he's trying to classify each pixel of the spectrograph as vocal/non-vocal. I'd be curious if he could drop the dataset into pix2pix-style networks and achieve the same results.

syntaxing7y ago

Clicked into to the article because I was curious how the training set was created. Using the acapella version is an amazing idea! Wished the article went more in-depth about this section.

Animats7y ago

Is it possible yet to take a recording of singing and generate a model of the singer for synthesis, like a Vocaloid?

samstave7y ago

A fun thing to do with this would be to slurp the lyrics from one song - the beats from another, some other stream from another and remix the “threads” together into something new.

Basically a giant equalizer that allows you to dim or brighten each channel from multiple sources.

smrtinsert7y ago

This project I've found to be very useful if you want access to something that like what the article describes. http://isse.sourceforge.net

canada_dry7y ago

I'd like to try using this kinda thing to build an automated beat saber map. The ability to orchestrate the beats very specifically would make for excellent mappings.

Alas so many projects, too little time!

dharma17y ago

Sounds pretty good but exhibits the same artifacts/phasing that I've heard with other source separation. Good for forensics etc but I wouldn't use this for music production

jtbayly7y ago

There was a similar demo (I think from Google) here on HN sometime last year that was far more impressive. I can't seem to find it though. Anybody know what it was?

exabrial7y ago

Are there any hearing aid manufacturers taking this approach? Quite incredible.

j / k navigate · click thread line to collapse

72 comments

57 comments · 21 top-level

bsaul7y ago· 15 in thread

If anyone has more info on the fundamental differences of the neural network approach compared to the "traditional" one, i'd be thankful.

derekp77y ago

robbrown4517y ago

redsky177y ago

1 more reply

heywire7y ago

tomc19857y ago

urvader7y ago

tobr7y ago

Removing the center is possible, but how exactly would you "recombine" it with the original to keep the center? Correct me if I'm wrong, but I don't think the math works out like that.

Say we have mixed three sources, let's call them L (panned hard left), C (center) and R (hard right). Then the left channel has +L+C, and the right channel has +R+C.

Now we phase invert one of them, say the right channel, and combine them. The new mono file is +L+C-R-C. +C-C cancels out and we're left with +L-R.

Since +R and -R essentially sounds the same, it sounds like if we had originally done a mono mix of L and R (+L+R).

But we can't combine this with a straight mono conversion (+L+C+R+C) in any way that will remove both L and R. All we can do is reproduce +L+C or +R+C.

1 more reply

skykooler7y ago

bsmith7y ago

This works especially well for pop tunes, where the vocal track is by far the most prominent and is almost universally panned dead-center in the mix.

cannam7y ago

Almost certainly something like "Real-time Sound Source Separation: Azimuth Discrimination and Resynthesis" (Barry et al, https://arrow.dit.ie/argcon/35/)

This is a more sophisticated generalisation of the idea of inverting one channel and averaging in order to isolate the centre-panned vocal (or remove it for karaoke purposes).

This is a better-specified problem than monoaural source separation, which I think is what the original article here is about.

bsmith7y ago

I'm a fairly n00b-level hobbyist music producer, but I can take a stab.

vonseel7y ago

I think you’re overestimating how independent each instruments section of the frequency spectrum is, even after mixing engineers cutting EQ to make places like you said.

The reality is, even instruments like kick drum or bass guitar have significant tonal content above 200hz, much of it overlapping with the guitar and vocal ranges.

nabla97y ago

>using the fact that they were "placed" on different spot in the stereo plane.

From the description it looks like they used FastICA. https://en.wikipedia.org/wiki/FastICA

The traditional is faster and less recourse intensive. This approach is just showing that it can be done at this point.

hooloovoo_zoo7y ago

Historically, one approach has been independent component analysis (https://en.wikipedia.org/wiki/Independent_component_analysis).

meatsock7y ago

you can compare the L and R channels to each other to separate signals out of a stereo field. https://cycling74.com/forums/separating-stereo-segments-poss...

emcq7y ago· 7 in thread

What motivates people to invent phrases like "perceptual binarization" when googling "audio binary mask" literally gives you citations in the field that have been doing this for years?

For example, 2009 musical sound separation based on binary time frequency masking.

Or more recent stuff using deep learning. Also the field generally prefers ratio masks because they lead to better sounding output.

avian7y ago

brianberns7y ago

One extreme amusing/alarming example of this: “A Mathematical Model for the Determination of Total Area Under Glucose Tolerance and Other Metabolic Curves”

https://academia.stackexchange.com/questions/9602/rediscover...

ska7y ago

That said, if you are still at the point of inventing new terms for things people have been doing for decades, you are probably being fairly superficial in the area as well.

1 more reply

evanweaver7y ago

We figured somebody else had for sure done this and kept searching for different keywords until we picked up a bunch of papers with the correct terminology.

sorryforthethro7y ago

Also, the time investment to learn a congruent field's jargon is often much greater than just making up words and let the internet peanut gallery sort out the synonyms for you.

coredog647y ago

Maybe this is a Cunningham’s Law type attempt at finding prior art for a new patent.

zxcvvcxz7y ago

If you look smart and inventive you appear higher social status. Helps nerds impress women I reckon.

8bitsrule7y ago· 4 in thread

Question: has any progress been made in removing reverb?

There are many, many historical recordings (and modern ones made in less-than-ideal circumstances) that suffer badly from reverb. Seems like a valuable use-case that -ought- to be in reach today.

marzell7y ago

Eli_P7y ago

Blur elimination is usually done with unsharp mask, which works by blurring raster even more and comparing to the original. The output makes the edges more sharp but some information is lost anyway.

I feel like I've oversimplified something, but it can be done, just lots of fidgeting with all the training sets and a training process itself.

[1] https://en.wikipedia.org/wiki/Cepstrum

1 more reply

grav7y ago

iZotope RX 7, mentioned in a comment above, has something called de-reverb. I didn't try that particular feature, but I did try the music rebalance feature, and it's pretty impressive.

8bitsrule7y ago

Thanks. I'd call that progress. (Wasn't aware of this 'industry standard audio repair tool'. From WPedia, looks like that just arrived last Sept.)

btown7y ago· 3 in thread

EDIT: The SANE conference has talks by Humphrey and many others available online: https://www.youtube.com/channel/UCsdxfneC1EdPorDUq9_XUJA/vid...

musicale7y ago

People have also been doing audio source separation effectively for years without neural networks.

calf7y ago

2 more replies

wodenokoto7y ago

> but it seems it was done without reference to academic literature on source separation

Sums up towards data science pretty well.

SyneRyder7y ago· 3 in thread

(Or to put it another way - there is commercial music software released in the last year that lets you do this yourself now.)

https://www.youtube.com/watch?v=kEauVQv2Quc

https://www.izotope.com/en/products/repair-and-edit/rx/music...

pizza7y ago

Going back further, X-Tracks did this ~5 years ago https://vimeo.com/107971872

That said, I think a deep learning approach will likely do a lot better (and be a lot easier to develop, imo)

Also, check out Google's Magenta project; it aims to use ML in various music / creativity projects.

Eli_P7y ago

[1] https://github.com/tzutalin/labelImg [2] https://en.wikipedia.org/wiki/Gated_recurrent_unit

1 more reply

ambicapter7y ago

1 more reply

sytelus7y ago· 1 in thread

Not gonna do that.

computerex7y ago

Share this sparingly lest it gets "fixed" ;) https://outline.com/SauXTY

switchbak7y ago· 1 in thread

Just wanted to mention there's some folks doing realtime source separation (not sure exactly how they've implemented it) with a DNN for reduction of background noise in, eg: Skype conversations.

I'm not involved with them in any way, but I've been amazed with its ability to cancel out coffee-shop style noise.

Check out https://krisp.ai/technology/ - Mac/Windows. I wish they had Linux support!

Edit: Appears they don't have Windows support yet.

brucemoose7y ago

This is both fascinating and horrifying at the same time! I wonder if/when it would be possible to rewrite whole words in real time using a voice that sounds just like you.

petra7y ago· 1 in thread

Question: Currently building earphones with great active-noise-cancellation is a secret kept within few companies.

This means they're expensive($300 headphones from Bose etc).

Do neural network make this simpler ?

And do you think they can be applied cheaply enough,say for $99 headphones ?

I assume this will sell really well, and justify creating a dedicated chip, with time.

sonnyblarney7y ago

Doing any kind of neural net anything in realtime is usually not possible due to processing power requirements.

sonnyblarney7y ago· 1 in thread

Which will give entirely new meaning to 'lip synching'.

vonseel7y ago

I can’t wait.

I guess sometimes you just have to be happy with what you have.

GistNoesis7y ago

Hello, a little self promotion, you can see it our experiment with some deep neural networks doing real-time audio processing in the browser, using tensorflow.js

http://gistnoesis.github.io/

If you want to see how it's done it's shared source : https://github.com/GistNoesis/Wisteria/

Thanks

tasty_freeze7y ago

Trivia: Avery Wang, the guy who invented the Shazam algorithm and was their CTO did his PhD thesis on this topic:

https://ccrma.stanford.edu/~jos/EE201/More_Recent_PhD_EEs_CC...

``Instantaneous and Frequency-Warped Signal Processing Techniques for Auditory Source Separation'' (1994)

En_gr_Student7y ago

There is a lagged autoregressive technique used in forensic analysis that allows 3d reconstruction using 1d (mic) sound.

plaidfuji7y ago

syntaxing7y ago

Clicked into to the article because I was curious how the training set was created. Using the acapella version is an amazing idea! Wished the article went more in-depth about this section.

Animats7y ago

Is it possible yet to take a recording of singing and generate a model of the singer for synthesis, like a Vocaloid?

samstave7y ago

A fun thing to do with this would be to slurp the lyrics from one song - the beats from another, some other stream from another and remix the “threads” together into something new.

Basically a giant equalizer that allows you to dim or brighten each channel from multiple sources.

smrtinsert7y ago

This project I've found to be very useful if you want access to something that like what the article describes. http://isse.sourceforge.net

canada_dry7y ago

I'd like to try using this kinda thing to build an automated beat saber map. The ability to orchestrate the beats very specifically would make for excellent mappings.

Alas so many projects, too little time!

dharma17y ago

Sounds pretty good but exhibits the same artifacts/phasing that I've heard with other source separation. Good for forensics etc but I wouldn't use this for music production

jtbayly7y ago

There was a similar demo (I think from Google) here on HN sometime last year that was far more impressive. I can't seem to find it though. Anybody know what it was?

exabrial7y ago

Are there any hearing aid manufacturers taking this approach? Quite incredible.

j / k navigate · click thread line to collapse