WaveNet: A Generative Model for Raw Audio (opens in new tab)

(deepmind.com)

627 pointsbenanne9y ago145 comments

145 comments

104 comments · 29 top-level

erichocean9y ago· 16 in thread

This can be used to implement seamless voice performance transfer from one speaker to another:

1. Train a WaveNet with the source speaker.

2. Train a second WaveNet with the target speaker. Or for something totally new, train a WaveNet with a bunch of different speakers until you get one you like. This becomes the target WaveNet.

3. Record raw audio from the source speaker.

Fun fact: any algorithmic process that "renders" something given a set of inputs can be "run in reverse" to recover those inputs given the rendered output. In this case, we now have raw audio from the source speaker that—in principle— could have been rendered by the source speaker's WaveNet, and we want to recover the inputs that would have rendered it, had we done so.

To do that, usually you convert all numbers in the forward renderer into Dual numbers and use automatic differentiation to recover the inputs (in this case, phonemes and what not).

4. Recover the inputs. (This is computationally expensive, but not difficult in practice, especially if WaveNet's generation algorithm is implemented in C++ and you've got a nice black-box optimizer to apply to the inputs, of which there are many freely available options.)

5. Take the recovered WaveNet inputs, feed them into the target speaker's WaveNet, and record the resulting audio.

Result: The resulting raw audio will have the same overall performance and speech as the source speaker, but rendered completely naturally in the target speaker's voice.

itcrowd9y ago

Another fun fact: this actually happens with (cell) phone calls.

You don't send your speech over the line, instead you send some parameters over the line which are then, at the receiving end, fed into a white(-ish) noise generator to recover the speech.

Edit: not by using a neural net or deep learning, of course.

JonnieCache9y ago

In case anyone is wondering, the technique is called linear predictive coding.

1 more reply

VikingCoder9y ago

What's the difference in bandwidth?

svantana9y ago

> Fun fact: any algorithmic process that "renders" something given a set of inputs can be "run in reverse"

Now wait a minute, most algorithms cannot be run in reverse! The only general way to reverse an algo is to try all possible inputs, which has exponential complexity. That's the basis of RSA encryption. Maybe you're thinking about automatic differentiation, a general algo to get the gradient of the output w.r.t. the inputs. That allows you to search for a matching input using gradient descent, but that won't give you an exact match for most interesting cases (due to local minima).

I'm not trying to nitpick -- in fact I believe that IF algos were reversible then human-level AI would have been solved a long time ago. Just write a generative function that is capable of outputting all possible outputs, reverse, and inference is solved.

shoo9y ago

This also makes me think of "inverse problems", in the context of mathematics, physics.

E.g. a forward problem might be to solve some PDE to simulate the state of a system from some known initial conditions.

The inverse problem could be to try to reverse engineer what the initial conditions were given the observed state of the system.

Inverse problems are typically much harder to deal with, and much harder to solve. E.g. perhaps they don't have a unique solution, or the solution is a highly discontinuous function of the inputs, which amplifies any measurement errors. In practice this can be addressed by regularisation aka introducing strong structural assumptions about what the expected solution should be like. This can be quite reasonable from a Bayesian perspective.

https://en.wikipedia.org/wiki/Inverse_problem#Mathematical_c...

romaniv9y ago

Maybe I'm reading this paper incorrectly, but it seems that in this system "voice" is part of the model parameters not inputs. What they did was train the same model with multiple reader voices while using one of the inputs to keep track of which voice the model was currently trained on. So the model can switch between different voices, but only between those which it was trained on.

"The conditioning was applied by feeding the speaker ID to the model in the form of a one-hot vector. The dataset consisted of 44 hours of data from 109 different speakers."

Am I missing something?

erichocean9y ago

These are the "inputs" I'm talking about recovering (from the link):

"In order to use WaveNet to turn text into speech, we have to tell it what the text is. We do this by transforming the text into a sequence of linguistic and phonetic features (which contain information about the current phoneme, syllable, word, etc.) and by feeding it into WaveNet."

The raw audio from Step 3 was (in principle) generated by that input on a properly trained WaveNet. We need to recover that so we can transfer it to the target WaveNet.

How a specific WaveNet instance is configured (as you point out, it's part of the model parameters) is an implementation detail that is irrelevant for the steps I proposed.

swsieber9y ago

Oh, pair this with facial mapping[1] and you pretty much have an "impersonate any famous person" system.

[1] http://www.graphics.stanford.edu/~niessner/thies2016face.htm...

erichocean9y ago

Yup, I work in virtual filmmaking and there are tons of way to use this stuff.

I give us 10-15 years before it's not possible to trust anything you see or hear that's recorded.

2 more replies

zardo9y ago

Basically the same idea as style transfer with image algorithms. Looking forward to Abraham Lincoln reading audiobooks to me.

infinite8s9y ago

That would require audio recordings of Abraham Lincoln's voice. Not sure recording technology existed back then.

1 more reply

dhammack9y ago

It seems like you're using WaveNet to do speech-to-text when we have better tools for that. To transfer text from Trump to Clinton, first run speech-to-text on Trump speech and then give that to a WaveNet trained on Clinton to generate speech that sounds like her but says the same thing as Trump.

erichocean9y ago

> It seems like you're using WaveNet to do speech-to-text

I'm proposing reducing a vocal performance into the corresponding WaveNet input. At no point in that process is the actual "text" recovered, and doing so would defeat the whole purpose, since I don't care about the text, I care about the performance of speaking the text (whatever it was).

In your example, I can't force Trump to say something in particular. But I can force myself, so I could record myself saying something I wanted Clinton to say [Step 3] (and in a particular way, too!), and if I had a trained WaveNet for myself and Clinton, I could make it seem like Clinton actually said it.

1 more reply

creshal9y ago

Sounds like a very fancy way to do compression with a massive custom dictionary.

posterboy9y ago

Thanks for the tl;dr. However, the fun fact is not true for surjective functions, IIRC, in which case multiple inputs may relate to one output, if this is relevant for WaveNets.

mdup9y ago

Nitpicking: surjective functions do not relate to unicity of ouptuts; you'd rather talk about non-injective functions. I agree with your point, though.

(surjective != non-injective, in the same way that non-increasing != decreasing)

augustl9y ago· 12 in thread

The music examples are utterly fascinating. It sounds insanely natural.

The only thing I can hear that sounds unnatural, is the way that the reverberation in the room (the "echo") immediately gets lower when the raw piano sound itself gets lower. In a real room, if you produce a loud sound and immediately after a soft sound, the reverberation of the loud sound remains. But since this network only models "the sound right now", the volume of the reverberation follows the volume of the piano sound.

To my ears, this is most prevalent in the last example, which starts out loud and gradually becomes softer. It sounds a bit like they are cross-fading between multiple recordings.

Regardless, the piano sounds completely natural to me, I don't hear any artifacts or sounds that a real piano wouldn't make. Amazing!

There are also fragments that sounds inspiring and very musical to my ears, such as the melody and chord progression after 00:08 in the first example.

TheOtherHobbes9y ago

I can hear some distortion in the piano notes - which may be an audio compression artefact, or it may be the output of the resynthesis process.

If you train NNs at the phrase level and overfit, then you get something that is indeed more or less the same as cross-fading at random between short sections.

Piano music is very idiomatic, so you'll capture some typical piano gestures that way.

But I'd be surprised if the music stays listenable for long. Classical music has big structures, and there's a difference between recognising letters (notes), recognising phrases (short sentences), recognising paragraphs (phrase structures), and parsing an entire piece (a novel or short story with characters and multiple plot lines.)

Corpus methods don't work very well for non-trivial music, because there's surprisingly little consistency at the more complex levels.

NN synthesis could be an interesting thing though. If you trained an NN on $sounds$ at various pitches and velocity levels, you might be able to squeeze a large and complex collection of samples into a compressed data set.

Even if the output isn't very realistic, you'd still get something unusual and interesting.

Scaevolus9y ago

The samples are uncompressed WAV files, so everything you hear is a direct result of the synthesis process. Some of the distortion is a result of the 16kHz sample rate-- it's not 44.1kHz CD quality.

1 more reply

DarkTree9y ago

It shot me forward to a time where people just click a button to generate music they want to listen to. If you really like the generation, you save it and share it. It wouldn't have all of the other aspects that we derive from human-produced music like soul/emotion (because we know it's coming from a human, not because of how it sounds), but it would be a cool application idea anyway.

JasonStorey9y ago

Have you tried https://www.jukedeck.com ? AI composed music at the touch of a button.

chriswarbo9y ago

Something like https://www.youtube.com/watch?v=Wx3by7ZaaZA ? ;)

mrkgnao9y ago

This reminds me of the Library of Babel short story.

ThePhysicist9y ago

I agree, the samples sound very natural. I ask myself though how similar they are to the data that has been used for training, as it would be trivial to rearrange individual pieces of a large training set in ways that sound good (especially if a human selects the good samples for presentation afterwards).

What I'd really like to see therefore is a systematic comparison of the generated music to the training set, ideally using a measure of similarity.

benanneOP9y ago

A nice property of the model is that it is easy to compute exact log-likelihoods for both training data and unseen data, so one can actually measure the degree of overfitting (which is not true for many other types of generative models). Another nice property of the model is that it seems to be extremely resilient to overfitting, based on these measurements.

augustl9y ago

Good point! Are (some of) the chords completely made up, for example, or is it only using chords it has heard before?

1 more reply

JoeDaDude9y ago

Decades ago, I was testing a LPC-10 vocoder. I discovered many new and strange sounds by playing with the input mike, such as blowing into it, or rubbing it. Like the LPC-10, I wonder about untapped musical possibilities that this allows.

ArkyBeagle9y ago

That seems completely tractable by simply adding a bit of the right reverb to the generated sample, more or less "in post".

augustl9y ago

Good point! Just train it with recordings that has no reverberation, and add it later.

1 more reply

grandalf9y ago· 12 in thread

This is incredible. I'd be worried if I were a professional audiobook reader :)

AndrewUnmuted9y ago

I worked for Audible for five years, and this exact conversation was had often in my division (ACX.com - Audible's "Audiobook Creation Exchange".)

Audible brought ACX together in order to bolster its catalog. The company-wide initiative was called PTTM ('pedal to the metal') and ACX was Audible's secret weapon to gain an enormous competitive foothold over the rest of the audiobook industry. Because we paid amateurs dirt-cheap rates to record horrible, self-published crap (to which Amazon, Audible's parent company had the exclusive rights), Audible was able to bolster its numbers substantially in a short period of time.

The dirty not-so-secret behind this strategy was: nobody bought these particular audiobooks. These audio titles were not really made to be "purchased," but rather to bulk up Audible's bottom line. We knew that the ACX titles were not popular, because the amateur narrators' acting talents and audio production skills were remarkably subpar.

Neural nets may be able to narrow the gap between the pros and the lowest-common-denominator to the point where they can become the next "ACX," but frankly, it won't matter to audiobook listeners, because audiobook listeners don't buy "ACX" audiobooks. Books, even in audio form, are a major intellectual and temporal commitment (not to mention -- they tend to be pricey.) Customers will always want to buy the human-narrated version of a book - the professional production of a book. If that stops being offered, Audible will anger a lot of customers and I think Bezos has better shit to worry about than his puny audiobooks subsidiary.

Despite that, user-generated content is a secret weapon that a lot of websites wield effectively - including HN - but this is beginning to shed its effectiveness. Indeed, the next generation of cost-slashing-while-polluting-the-quality-of-your-catalog will belong to the neural nets. They may be able to get better sales than ACX titles do today with AI-generated audio content, but the actors are going nowhere.

Falcon99y ago

I've listened to some LibriVox recordings of public domain works, notably A Princess of Mars. The price was right at the time, though the quality was, as you say, remarkably subpar. If I could have had a neural net read me the book instead of having to change with narrators changing every chapter, that would have been preferable.

That said, I have money now, so give me Todd McLaren narrating Altered Carbon for the cost of an Audible Credit every time.

espadrine9y ago

I wouldn't. The results they offer are excellent, but the missing points they need to achieve human level are related to producing the correct intonation, which requires accurate understanding of the material. That is still at least ten years in the future, I expect.

syllogism9y ago

Not really. They're training directly on the waveform, so the model can learn intonation. They just need to train on longer samples, and perhaps augment their linguistic representation with some extra discourse analysis.

A big problem with generating prosody has always been that our theories of it don't really provide a great prediction of people's behaviours. It's also very expensive to get people to do the prosody annotations accurately, using whatever given theory.

Predicting the raw audio directly cuts out this problem. The "theory" of prosody can be left latent, rather than specified explicitly.

1 more reply

grandalf9y ago

I don't see why many aspects of intonation couldn't be taught the same way ...

1 more reply

badminton19y ago

There is significant advance in sentiment analysis too. Trading bots use sentiment analysis as some of the input for their time series prediction algorithms. I would not say 10 years.

iandanforth9y ago

What about auto-tuning? I can do a pretty good reading-with-intention but I don't have the melt-your-brain-rich tones of Stephen Fry or Ian McKellen.

swalsh9y ago

That is so exciting for me. I love listening to audiobooks when I'm walking my dog, or driving, or something boring that doesn't need my brain but does need my arms.

The issue is the selection is so much smaller than the selection of books.

grandalf9y ago

Indeed. It also sounds like it could be trained to correctly read math or code, the two things that require enough expertise to properly pronounce that most text to speech engines fail miserably.

Something like:

  a(b+c)

"a times the quantity b plus c"

If read with proper inflection, this would be a vast improvement and could open up all sorts of technical material to people for whom audio learning is preferred.

I think back to the first math teacher I had whose pronunciation of the notation was precise and unambiguous enough that one didn't really have to be watching the board. This is a rare gift, yet it is possible in many areas of math, yet few teachers master it (or realize how helpful it is).

mdip9y ago

I'm an audiobook junkie and as far as professional narrators go, I think it'd be hard to replace a high-end performance with something computer generated and end up with the level of quality offered by the likes of a great narrator like Scott Brick. I mention him by name because it was him that made me realize how important good quality narration is. I had purchased a book at an airport bookstore on a whim and while waiting for a plane was so disgusted with the poor quality of the writing that I actually threw the book out[0]. Years later, I had grabbed an audio book by an author I hadn't heard of simply because it was read by Scott Brick and recommended to "Read Next". Two hours in and I realized the book I had been enjoying so much was the same terrible book I had thrown out years before[1].

While I don't doubt it'll be possible for a computer to match it with enough input data (both in voice and human adjustment), it'll probably be a while before we'll be there and when we are there it'll likely require a lot of adjustment on the part of a professional. A big part of narration is knowing when and where a part of the story requires additional voice acting (and understanding what is required). A machine generated narration would have to understand the story sufficiently to be able to do that correctly. They might be able to get the audio to sound as good as it would sound if I narrated it, but someone with talent in the area is going to be hard to match.

All of that aside, it's getting pretty close to "good enough". When it reaches that point, my hope is more books will have audio versions available[2] and in all likelihood, some books that would have been narrated by a person today will likely be narrated by technology when it reaches that point, limiting human narration only to the top x% of books.

[0] I always resell books or donate them. This book was so bad that the half-hour it took from my life felt like a tragedy. I threw it out to prevent someone from experiencing its awfulness -- even for free.

[1] I realized it was the same book at the point a story was told that I had only read in the first book (and found mildly humorous). The reason I hated the other book was that it was written in the first person as a New York cop. I couldn't form a mental picture and the character was entirely unbelievable and one dimensional. When narrated properly, that problem was eliminated.

[2] I "speed read" (not gimmicky ... scan/skimming) and consume a ton of text. I've been doing it for 20 years or so and find it difficult to read word-for-word as is required for enjoyment of fiction, so to "force" it, I stick with audio books for fiction and love them.

grandalf9y ago

I too greatly appreciate highly skilled readers. It's another layer of creativity and inspiration in addition to the text, and when done well adds a lot to the book.

visarga9y ago

I only fell in love with the voice of a single audiobook narrator. I checked, and yes, he was Scott Brick. I think he adds about 50% on top of the value of the written book by his interpretation.

1 more reply

noonespecial9y ago· 5 in thread

So when I get the AI from one place, train it with the voices of hundreds of people from dozens of other sources, and then have it read a book from Project Gutenberg to an mp3... who owns the mechanical rights to that recording?

visarga9y ago

> who owns the mechanical rights to that recording?

The monkey who shot the picture. https://en.wikipedia.org/wiki/Monkey_selfie

novalis789y ago

good point ... I am pretty sure there are a thousand audible products waiting to be launched.

kuschku9y ago

Every single person who had rights on the sources for audio you used.

For the same reason, Google training neural networks with userdata is very legally doubtful – they changed the ToS, but also used data collected before the ToS change for that.

feral9y ago

>Every single person who had rights on the sources for audio you used.

What if my 'AI' was a human who learned to speak by being trained with the voices of hundreds of people from dozens of other sources? What's the difference?

Those waters seem muddy. I think that'd be an interesting copyright case, don't think it's self evident.

1 more reply

skoocda9y ago

LibriVox

jay-anderson9y ago· 4 in thread

Any suggestions on where to start learning how to implement this? I understand some of the high level concepts (and took an intro AI class years ago - probably not terribly useful), but some of them are very much over my head (e.g. 2.2 Softmax Distributions and 2.3 Gated Activation Units) and some parts of the paper feel somewhat hand-wavy (2.6 Context Stacks). Any pointers would be useful as I attempt to understand it. (EDIT: section numbers refer to their paper)

visarga9y ago

Best advice is to wait for a version to pop up on github. It's hard to implement such a paper as a beginner.

datenwolf9y ago

Well, I think since we now have frameworks for doing this kind of stuff (Tensorflow and similar) the barrier of entry is much, much lower. Also the computing power required to build the models can be found in commodity GPUs.

On a hunch I'd say an absolute beginner may be able good results with these tools, just not as quickly as experts on the field who already know how to use the tools properly. That's why I'm going to wait for something to pop up on GitHub, because I have zero practical experience with these things, but I can read these papers comfortably without the need to look up every other term.

There are a number of applications I'd like to throw at deep learning to see how it performs. Most notably I'd like to see how well a deep learning system can extract feature from speckle images. At the moment you have to average out the speckles from ultrasound or OCT images before you can feed it to a feature recognition system. Unfortunately this kind of averaging eliminates certain information you might want to process further down the line.

jm547ster9y ago

Agreed there's a lot of breath here, I'm coming from the opposite end with some experience in "manual" concatenative speech synthesis and very little in the ML area, you'd need to be cross disciplined from the get go

jay-anderson9y ago

https://github.com/ibab/tensorflow-wavenet - looks like they're starting to show up.

fastoptimizer9y ago· 4 in thread

Do they say how much time is the generation taking?

Is this insanely slow to train but extremely fast to do generation?

georgehm9y ago

"After training, we can sample the network to generate synthetic utterances. At each step during sampling a value is drawn from the probability distribution computed by the network. This value is then fed back into the input and a new prediction for the next step is made. Building up samples one step at a time like this is computationally expensive, but we have found it essential for generating complex, realistic-sounding audio."

So it looks like generation is a slow process.

kastnerkyle9y ago

Relatively, training is fast (due to parallelism / masking so you don't have to sample during training) but during generation sampling is a sequential process. They talk about it a bit in the previous papers for PixelCNN and PixelRNN.

microtherion9y ago

According to 3rd hand reports I've heard (apply copious amounts of salt), it may take 1 hour of CPU time to generate 1 second of speech.

lucb1e9y ago

I was wondering the same. They don't mention anything about how long it took on what kind of system. Even for a first beta it would give us some ballpark idea of how slow it is -- because it's clearly slow, they just keep back how slow exactly, so it's probably bad.

ronreiter9y ago· 4 in thread

Please please please someone please share an IPython notebook with something working already :)

ThePhysicist9y ago

I have some iPython notebooks for speech analysis using a Chinese corpus. I used those for a tutorial on machine learning with Python and unfortunately they are still a bit incomplete, but maybe you find them useful nevertheless (no deep learning involved though). What I do in the tutorial is to start from a WAV file and then go through all the steps required for analyzing the data (using a "traditional" approach), i.e. generate the Mel-Cepstrum coefficients of the segmented audio data and then train a model to distinguish individual words. Word segmentation is another topic that I touch a bit, and where we can also use machine learning to improve the results.

Here's a version with very simple speech training data (basically just different syllables with different tones):

https://github.com/adewes/machine-learning-chinese/blob/mast...

More complex speech training data (from a real-world Chinese speech corpus [not included but downloadable]):

https://github.com/adewes/machine-learning-chinese/blob/mast...

There are other parts of the tutorial that deal with Chinese text and character recognition as well, if you're interested:

https://github.com/adewes/machine-learning-chinese

For part 2 I also train a simple neural network with lasagne (a Python library for deep learning), and I plan to add more deep learning content and do a clean write-up of the whole thing as soon as I have some more time.

ronreiter9y ago

Thanks! will take a look.

visarga9y ago

It takes 90 minutes to synthesize 1 second. Sorry, no laptop version yet.

https://twitter.com/hardmaru/status/773968758519902208

novalis789y ago

I second that!

chestervonwinch9y ago· 3 in thread

Is it possible to use the "deep dream" methods with a network trained for audio such as this? I wonder what that would sound like, e.g., beginning with a speech signal and enhancing with a network trained for music or vice versa.

dontreact9y ago

We tried this but with less success than what wavenet did. https://wp.nyu.edu/ismir2016/wp-content/uploads/sites/2294/2...

dontreact9y ago

There is a link to examples at the end

1 more reply

Applejinx9y ago

The piano stuff already seemed like 'dream music', as did the 'babble' examples. I found myself terribly frustrated by how short all those examples were. I wanted lots more :)

dharma19y ago· 2 in thread

The samples sound amazing. These causal convolutions look like a great idea, will have to re-read a few times. All the previous generative audio from raw audio samples I've heard (using LSTM) has been super noisy. These are crystal clear.

Dilated convolutions are already implemented in TF, look forward to someone implementing this paper and publishing the code.

kastnerkyle9y ago

I did a review for PixelCNN as a part of my summer internship, it covers a bit about how careful masking can be used to create a chain of conditional probabilities [0], which AFAIK is exactly how this "causal convolution" works (can't have dependencies in the 'future'). The PixelCNN and PixelRNN papers also cover this in a fair bit of detail. Ishaan Gulrajani's code is also a great implementation reference for PixelCNN / masking [1].

[0] https://github.com/tensorflow/magenta/blob/master/magenta/re...

[1] https://github.com/igul222/pixel_rnn/blob/master/pixel_rnn.p...

dharma19y ago

Heh, just read it! Very useful, will have to go through in detail

novalis789y ago· 2 in thread

What's really intriguing is the part in their article where they explain the "babbling" of wavenet, when they train the network without the text input.

That sounds just like a small kid imitating a foreign (or their own) language. My kids grow up bilingual and I hear them attempt something similar when they are really small. I guess it's like listening in to their neural network modelling the sound of the new language.

sjwright9y ago

To my Australian English ears, the babbling sounded vaguely Scandinavian.

novalis789y ago

Indeed. I was surprised by that as well. Sounded like a Dutch speaker with a muffled voice behind a screen.

4 more replies

JoshTriplett9y ago· 2 in thread

How much data does a model take up? I wonder if this would work for compression? Train a model on a corpus of audio, then store the audio as text that turns back into a close approximation of that audio. (Optionally store deltas for egregious differences.)

kastnerkyle9y ago

It would be a slow (but very efficient information-wise - only have to send text which itself can be compressed!) decompression process with current models / hardware due to sequential relationships in generation.

I am sure people will start trying to speed this up, as it could be a game changer in that space with a fast enough implementation. Google also has a lot of great engineers with direct motivation to get it working on phones, and a history of porting recent research in to the Android speech pipeline.

The results speak for themselves - step 1 is almost always "make it work" after all, and this works amazingly well! Step 2 or 3 is "make it fast", depending who you ask.

Houshalter9y ago

We've known for decades that neural networks are really good at image and video compression. But as far as I know, this has never been used in practice, because the compression and decompression times are ridiculous. I imagine this would be even more true for audio.

1 more reply

JonnieCache9y ago· 2 in thread

Wow. I badly want to try this out with music, but I've taken little more than baby steps with neural networks in the past: am I stuck waiting for someone else to reimplement the stuff in the paper?

IIRC someone published an OSS implementation of the deep dreaming image synthesis paper fairly quickly...

kastnerkyle9y ago

Re-implementation will be hard, several people (including me) have been working on related architectures, but they have a few extra tricks in WaveNet that seem to make all the difference, on top of what I assume is "monster scale training, tons of data".

The core ideas from this can be seen in PixelRNN and PixelCNN, and there are discussions and implementations for the basic concepts of those out there [0][1]. Not to mention the fact that conditioning is very interesting / tricky in this model, at least as I read it. I am sure there are many ways to do it wrong, and getting it right is crucial to having high quality results in conditional synthesis.

[0] https://github.com/tensorflow/magenta/blob/master/magenta/re...

[1] https://github.com/igul222/pixel_rnn/blob/master/pixel_rnn.p...

JonnieCache9y ago

Is there any usable example code out there I can play with? I don't care if it sounds noisy and weird, it's all grist for the sampler anyway.

rounce9y ago· 2 in thread

So when does the album drop?

rounce9y ago

In case the above came across as an example of bad sarcasm, I'm very serious. I've a somewhat lazy interest in generative music, and found the snippets in the paper quite appealing.

Though, as was mentioned in a previous comment, due to copyright (attribution based on training data sources, blah blah) I might already have an answer. :(

b0ner_t0ner9y ago

“Is this Hiromi Uehara or WaveNet?”

rdtsc9y ago· 1 in thread

Wonder if there are any implications here for breaking (MitM) ZRTP protocol.

https://en.wikipedia.org/wiki/ZRTP

At some point to authenticate both parties verify a short message by reading it to each other.

However, NSA has already tried to MitM that about 10 years ago by using voice synthesis. It was deemed inadequate at the time. Wonder if TTS improvements like these, change that game and make it more plausable scenario.

luckystarr9y ago

This will make private in person key exchange way more important. Especially as the attack vector is so cheap (software).

fpgaminer9y ago· 1 in thread

I'm guessing DeepMind has already done this (or is already doing), but conditioning on a video is the obvious next step. It would be incredibly interesting to see how accurate it can get generating the audio for a movie. Though I imagine for really great results they'll need to mix in an adversarial network.

visarga9y ago

Oh yes, extract voice and intonation from one language, and then synthesize it in another language -> we get automated dubbing. Could also possibly try to lipsync.

1 more reply

visarga9y ago· 1 in thread

And when you think of all those Hollywood SF movies where the robot could reason and act quite well but in a tin-voice. How wrong they got it. We can simulate high quality voices but we can't have our reasoning, walking robots.

ilaksh9y ago

Depending on how you mean 'reasoning, walking robots' then not yet really.. but every few weeks or months another amazing deep learning/NN whatever thing comes out in different domains. So these types of techniques seem to have very broad application.

Of course, if you mean 'walking' in a literal sense, there are a number of impressive walking robots such as Atlas https://www.youtube.com/watch?v=rVlhMGQgDkY, HRP-2 https://www.youtube.com/watch?v=T6BSSWWV-60 or HRP 4C https://www.youtube.com/watch?v=YvbAqw0sk6M, etc.. Also there are many types of useful reasoning systems. I am guessing you are thinking of language understanding and generation.. but I believe these types of techniques are being applied quite impressively in that area also, from DeepMind or Watson https://www.youtube.com/watch?v=i-vMW_Ce51w etc.

imaginenore9y ago· 1 in thread

Please make it sound like Morgan Freeman.

TeeWEE9y ago

Morgan Freeman +1

imurray9y ago· 1 in thread

Would delete this post if I could. Was a request to fix a broken link. Now fixed.

andrew37269y ago

It seems fixed now.

bbctol9y ago

Wow! I'd been playing around with machine learning and audio, and this blows even my hilariously far-future fantasies of speech generation out of the water. I guess when you're DeepMind, you have both the brainpower and resources to tackle sound right at the waveform level, and rely on how increasingly-magical your NNs seem to rebuild everything else you need. Really amazing stuff.

ericjang9y ago

"At Vanguard, my voice is my password..."

kragen9y ago

This is amazing. And it's not even a GAN. Presumably a GAN version of this would be even more natural — or maybe they tried that and it didn't work so they didn't put it in the paper?

Definitely the death knell for biometric word lists.

banach9y ago

I hope this shows up as a TTS option for VoiceDream (http://www.voicedream.com/) soon! With the best voices they have to offer (currently, the ones from Ivona), I can suffer through a book if the subject is really interesting, but the way the samples sounded here, the WaveNet TTS could be quite pleasant to listen to.

nitrogen9y ago

I wonder how a hybrid model would sound, where the net generates parameters for a parametric synthesis algorithm (or a common speech codec) instead of samples, to reduce CPU costs.

badminton19y ago

The first to do semantic style transfer on audio gets a cookie!

mtgx9y ago

When can we expect this to be used in Google's TTS engine?

tunnuz9y ago

Love the music part! Mmmh ... infinite jazz.

AstralStorm9y ago

Finally a convincing Simlish generator!

billconan9y ago

hope they can release some source code.

wonder how many gpus are required to hold this model.

baccheion9y ago

I suppose it's impressive in a way, but when I looked into "smoothing out" text to speech audio a few years ago, it seemed fairly straightforward. I was left wondering why it hadn't been done already, but alas, most Engineers at these companies are either politicking know-nothing idiots, or are constantly being road blocked, preventing them from making any real advancements.

j / k navigate · click thread line to collapse

145 comments

104 comments · 29 top-level

erichocean9y ago· 16 in thread

This can be used to implement seamless voice performance transfer from one speaker to another:

1. Train a WaveNet with the source speaker.

2. Train a second WaveNet with the target speaker. Or for something totally new, train a WaveNet with a bunch of different speakers until you get one you like. This becomes the target WaveNet.

3. Record raw audio from the source speaker.

To do that, usually you convert all numbers in the forward renderer into Dual numbers and use automatic differentiation to recover the inputs (in this case, phonemes and what not).

5. Take the recovered WaveNet inputs, feed them into the target speaker's WaveNet, and record the resulting audio.

Result: The resulting raw audio will have the same overall performance and speech as the source speaker, but rendered completely naturally in the target speaker's voice.

itcrowd9y ago

Another fun fact: this actually happens with (cell) phone calls.

You don't send your speech over the line, instead you send some parameters over the line which are then, at the receiving end, fed into a white(-ish) noise generator to recover the speech.

Edit: not by using a neural net or deep learning, of course.

JonnieCache9y ago

In case anyone is wondering, the technique is called linear predictive coding.

1 more reply

VikingCoder9y ago

What's the difference in bandwidth?

svantana9y ago

> Fun fact: any algorithmic process that "renders" something given a set of inputs can be "run in reverse"

shoo9y ago

This also makes me think of "inverse problems", in the context of mathematics, physics.

E.g. a forward problem might be to solve some PDE to simulate the state of a system from some known initial conditions.

The inverse problem could be to try to reverse engineer what the initial conditions were given the observed state of the system.

https://en.wikipedia.org/wiki/Inverse_problem#Mathematical_c...

romaniv9y ago

"The conditioning was applied by feeding the speaker ID to the model in the form of a one-hot vector. The dataset consisted of 44 hours of data from 109 different speakers."

Am I missing something?

erichocean9y ago

These are the "inputs" I'm talking about recovering (from the link):

The raw audio from Step 3 was (in principle) generated by that input on a properly trained WaveNet. We need to recover that so we can transfer it to the target WaveNet.

How a specific WaveNet instance is configured (as you point out, it's part of the model parameters) is an implementation detail that is irrelevant for the steps I proposed.

swsieber9y ago

Oh, pair this with facial mapping[1] and you pretty much have an "impersonate any famous person" system.

[1] http://www.graphics.stanford.edu/~niessner/thies2016face.htm...

erichocean9y ago

Yup, I work in virtual filmmaking and there are tons of way to use this stuff.

I give us 10-15 years before it's not possible to trust anything you see or hear that's recorded.

2 more replies

zardo9y ago

Basically the same idea as style transfer with image algorithms. Looking forward to Abraham Lincoln reading audiobooks to me.

infinite8s9y ago

That would require audio recordings of Abraham Lincoln's voice. Not sure recording technology existed back then.

1 more reply

dhammack9y ago

erichocean9y ago

> It seems like you're using WaveNet to do speech-to-text

1 more reply

creshal9y ago

Sounds like a very fancy way to do compression with a massive custom dictionary.

posterboy9y ago

Thanks for the tl;dr. However, the fun fact is not true for surjective functions, IIRC, in which case multiple inputs may relate to one output, if this is relevant for WaveNets.

mdup9y ago

Nitpicking: surjective functions do not relate to unicity of ouptuts; you'd rather talk about non-injective functions. I agree with your point, though.

(surjective != non-injective, in the same way that non-increasing != decreasing)

augustl9y ago· 12 in thread

The music examples are utterly fascinating. It sounds insanely natural.

To my ears, this is most prevalent in the last example, which starts out loud and gradually becomes softer. It sounds a bit like they are cross-fading between multiple recordings.

Regardless, the piano sounds completely natural to me, I don't hear any artifacts or sounds that a real piano wouldn't make. Amazing!

There are also fragments that sounds inspiring and very musical to my ears, such as the melody and chord progression after 00:08 in the first example.

TheOtherHobbes9y ago

I can hear some distortion in the piano notes - which may be an audio compression artefact, or it may be the output of the resynthesis process.

If you train NNs at the phrase level and overfit, then you get something that is indeed more or less the same as cross-fading at random between short sections.

Piano music is very idiomatic, so you'll capture some typical piano gestures that way.

Corpus methods don't work very well for non-trivial music, because there's surprisingly little consistency at the more complex levels.

Even if the output isn't very realistic, you'd still get something unusual and interesting.

Scaevolus9y ago

The samples are uncompressed WAV files, so everything you hear is a direct result of the synthesis process. Some of the distortion is a result of the 16kHz sample rate-- it's not 44.1kHz CD quality.

1 more reply

DarkTree9y ago

JasonStorey9y ago

Have you tried https://www.jukedeck.com ? AI composed music at the touch of a button.

chriswarbo9y ago

Something like https://www.youtube.com/watch?v=Wx3by7ZaaZA ? ;)

mrkgnao9y ago

This reminds me of the Library of Babel short story.

ThePhysicist9y ago

What I'd really like to see therefore is a systematic comparison of the generated music to the training set, ideally using a measure of similarity.

benanneOP9y ago

augustl9y ago

Good point! Are (some of) the chords completely made up, for example, or is it only using chords it has heard before?

1 more reply

JoeDaDude9y ago

ArkyBeagle9y ago

That seems completely tractable by simply adding a bit of the right reverb to the generated sample, more or less "in post".

augustl9y ago

Good point! Just train it with recordings that has no reverberation, and add it later.

1 more reply

grandalf9y ago· 12 in thread

This is incredible. I'd be worried if I were a professional audiobook reader :)

AndrewUnmuted9y ago

I worked for Audible for five years, and this exact conversation was had often in my division (ACX.com - Audible's "Audiobook Creation Exchange".)

Falcon99y ago

That said, I have money now, so give me Todd McLaren narrating Altered Carbon for the cost of an Audible Credit every time.

espadrine9y ago

syllogism9y ago

Predicting the raw audio directly cuts out this problem. The "theory" of prosody can be left latent, rather than specified explicitly.

1 more reply

grandalf9y ago

I don't see why many aspects of intonation couldn't be taught the same way ...

1 more reply

badminton19y ago

There is significant advance in sentiment analysis too. Trading bots use sentiment analysis as some of the input for their time series prediction algorithms. I would not say 10 years.

iandanforth9y ago

What about auto-tuning? I can do a pretty good reading-with-intention but I don't have the melt-your-brain-rich tones of Stephen Fry or Ian McKellen.

swalsh9y ago

That is so exciting for me. I love listening to audiobooks when I'm walking my dog, or driving, or something boring that doesn't need my brain but does need my arms.

The issue is the selection is so much smaller than the selection of books.

grandalf9y ago

Indeed. It also sounds like it could be trained to correctly read math or code, the two things that require enough expertise to properly pronounce that most text to speech engines fail miserably.

Something like:

  a(b+c)

"a times the quantity b plus c"

If read with proper inflection, this would be a vast improvement and could open up all sorts of technical material to people for whom audio learning is preferred.

mdip9y ago

grandalf9y ago

I too greatly appreciate highly skilled readers. It's another layer of creativity and inspiration in addition to the text, and when done well adds a lot to the book.

visarga9y ago

I only fell in love with the voice of a single audiobook narrator. I checked, and yes, he was Scott Brick. I think he adds about 50% on top of the value of the written book by his interpretation.

1 more reply

noonespecial9y ago· 5 in thread

visarga9y ago

> who owns the mechanical rights to that recording?

The monkey who shot the picture. https://en.wikipedia.org/wiki/Monkey_selfie

novalis789y ago

good point ... I am pretty sure there are a thousand audible products waiting to be launched.

kuschku9y ago

Every single person who had rights on the sources for audio you used.

For the same reason, Google training neural networks with userdata is very legally doubtful – they changed the ToS, but also used data collected before the ToS change for that.

feral9y ago

>Every single person who had rights on the sources for audio you used.

What if my 'AI' was a human who learned to speak by being trained with the voices of hundreds of people from dozens of other sources? What's the difference?

Those waters seem muddy. I think that'd be an interesting copyright case, don't think it's self evident.

1 more reply

skoocda9y ago

LibriVox

jay-anderson9y ago· 4 in thread

visarga9y ago

Best advice is to wait for a version to pop up on github. It's hard to implement such a paper as a beginner.

datenwolf9y ago

jm547ster9y ago

jay-anderson9y ago

https://github.com/ibab/tensorflow-wavenet - looks like they're starting to show up.

fastoptimizer9y ago· 4 in thread

Do they say how much time is the generation taking?

Is this insanely slow to train but extremely fast to do generation?

georgehm9y ago

So it looks like generation is a slow process.

kastnerkyle9y ago

microtherion9y ago

According to 3rd hand reports I've heard (apply copious amounts of salt), it may take 1 hour of CPU time to generate 1 second of speech.

lucb1e9y ago

ronreiter9y ago· 4 in thread

Please please please someone please share an IPython notebook with something working already :)

ThePhysicist9y ago

Here's a version with very simple speech training data (basically just different syllables with different tones):

https://github.com/adewes/machine-learning-chinese/blob/mast...

More complex speech training data (from a real-world Chinese speech corpus [not included but downloadable]):

https://github.com/adewes/machine-learning-chinese/blob/mast...

There are other parts of the tutorial that deal with Chinese text and character recognition as well, if you're interested:

https://github.com/adewes/machine-learning-chinese

ronreiter9y ago

Thanks! will take a look.

visarga9y ago

It takes 90 minutes to synthesize 1 second. Sorry, no laptop version yet.

https://twitter.com/hardmaru/status/773968758519902208

novalis789y ago

I second that!

chestervonwinch9y ago· 3 in thread

dontreact9y ago

We tried this but with less success than what wavenet did. https://wp.nyu.edu/ismir2016/wp-content/uploads/sites/2294/2...

dontreact9y ago

There is a link to examples at the end

1 more reply

Applejinx9y ago

The piano stuff already seemed like 'dream music', as did the 'babble' examples. I found myself terribly frustrated by how short all those examples were. I wanted lots more :)

dharma19y ago· 2 in thread

Dilated convolutions are already implemented in TF, look forward to someone implementing this paper and publishing the code.

kastnerkyle9y ago

[0] https://github.com/tensorflow/magenta/blob/master/magenta/re...

[1] https://github.com/igul222/pixel_rnn/blob/master/pixel_rnn.p...

dharma19y ago

Heh, just read it! Very useful, will have to go through in detail

novalis789y ago· 2 in thread

What's really intriguing is the part in their article where they explain the "babbling" of wavenet, when they train the network without the text input.

sjwright9y ago

To my Australian English ears, the babbling sounded vaguely Scandinavian.

novalis789y ago

Indeed. I was surprised by that as well. Sounded like a Dutch speaker with a muffled voice behind a screen.

4 more replies

JoshTriplett9y ago· 2 in thread

kastnerkyle9y ago

The results speak for themselves - step 1 is almost always "make it work" after all, and this works amazingly well! Step 2 or 3 is "make it fast", depending who you ask.

Houshalter9y ago

1 more reply

JonnieCache9y ago· 2 in thread

Wow. I badly want to try this out with music, but I've taken little more than baby steps with neural networks in the past: am I stuck waiting for someone else to reimplement the stuff in the paper?

IIRC someone published an OSS implementation of the deep dreaming image synthesis paper fairly quickly...

kastnerkyle9y ago

[0] https://github.com/tensorflow/magenta/blob/master/magenta/re...

[1] https://github.com/igul222/pixel_rnn/blob/master/pixel_rnn.p...

JonnieCache9y ago

Is there any usable example code out there I can play with? I don't care if it sounds noisy and weird, it's all grist for the sampler anyway.

rounce9y ago· 2 in thread

So when does the album drop?

rounce9y ago

In case the above came across as an example of bad sarcasm, I'm very serious. I've a somewhat lazy interest in generative music, and found the snippets in the paper quite appealing.

Though, as was mentioned in a previous comment, due to copyright (attribution based on training data sources, blah blah) I might already have an answer. :(

b0ner_t0ner9y ago

“Is this Hiromi Uehara or WaveNet?”

rdtsc9y ago· 1 in thread

Wonder if there are any implications here for breaking (MitM) ZRTP protocol.

https://en.wikipedia.org/wiki/ZRTP

At some point to authenticate both parties verify a short message by reading it to each other.

luckystarr9y ago

This will make private in person key exchange way more important. Especially as the attack vector is so cheap (software).

fpgaminer9y ago· 1 in thread

visarga9y ago

Oh yes, extract voice and intonation from one language, and then synthesize it in another language -> we get automated dubbing. Could also possibly try to lipsync.

1 more reply

visarga9y ago· 1 in thread

ilaksh9y ago

imaginenore9y ago· 1 in thread

Please make it sound like Morgan Freeman.

TeeWEE9y ago

Morgan Freeman +1

imurray9y ago· 1 in thread

Would delete this post if I could. Was a request to fix a broken link. Now fixed.

andrew37269y ago

It seems fixed now.

bbctol9y ago

ericjang9y ago

"At Vanguard, my voice is my password..."

kragen9y ago

This is amazing. And it's not even a GAN. Presumably a GAN version of this would be even more natural — or maybe they tried that and it didn't work so they didn't put it in the paper?

Definitely the death knell for biometric word lists.

banach9y ago

nitrogen9y ago

I wonder how a hybrid model would sound, where the net generates parameters for a parametric synthesis algorithm (or a common speech codec) instead of samples, to reduce CPU costs.

badminton19y ago

The first to do semantic style transfer on audio gets a cookie!

mtgx9y ago

When can we expect this to be used in Google's TTS engine?

tunnuz9y ago

Love the music part! Mmmh ... infinite jazz.

AstralStorm9y ago

Finally a convincing Simlish generator!

billconan9y ago

hope they can release some source code.

wonder how many gpus are required to hold this model.

baccheion9y ago

j / k navigate · click thread line to collapse