Deep Voice: Real-Time Neural Text-To-Speech (opens in new tab)

(research.baidu.com)

244 pointsPieSquared9y ago77 comments

77 comments

59 comments · 16 top-level

PieSquaredOP9y ago· 19 in thread

Hey there! I'm one of the authors of the paper and I'm happy to answer any questions anyone may have!

Make sure to check out the paper on arxiv as well.

albertzeyer9y ago

Hi,

That is some very nice and interesting work! In fact, I have also worked on exactly the same thing, so I'm impressed by your accomplishments.

How much have you played around with different local condition features, i.e. the phoneme signal? Was it always with 256 Hz? Have you always used nearest-neighbor for upsampling to 16 kHz? Have you always used those 2 + (1 + 2 + 2) * (40 + 5) = 227 dimensions? We tried just with 39 dimensional phonemes, which also worked but the quality was not so nice and it sounded very robotic, probably due to missing F0. We also only had 100 Hz, but we tried some variants to upscale it to 16 kHz, like linear interpolation or deconv or combinations of them.

In the local conditioning network, you used QRNNs. Did you also try simpler methods, like just pure convolution? (And then the upsampling like you did, by nearest neighbor.)

You are predicting phone duration + F0. Have you also tried an encoder-decoder approach instead, like in Char2Wav? I.e. instead of the duration prediction, you let the decoder unroll it. Then, also like Char2Wav, you can also combine that directly with your Grapheme-to-Phoneme model. Have you tried that?

Did you also try some global condition, like speaker identity?

We also tried all the sampling methods you are listing and observed the same behavior, i.e. only the direct sampling really works. I tried many more deterministic variants (like taking mean) but none of them worked. This is a bit strange. Also the quality can vary depending on the random seed.

Thanks, Albert

PieSquaredOP9y ago

Feel free to get in touch for more Q/A, my email is in my profile.

We've experimented a bunch with many of these hyperparameters. Our phoneme signal has mostly stayed 256 Hz, but we've done a few experiments with lower-frequency signals that indicate it's probably possible to reduce it.

We have used many types of upsampling, and find that the upsampling and conditioning procedure does not affect the quality of the audio itself, but does affect the frequency of pronunciation mistakes. We used bicubic and bilinear interpolation based upsampling, as well as transposed convolutions and a variety of other simpler convolutions (for example, per-channel transposed convolutions). These tend to work and converge, but then generate pronunciation mistakes on difficult phonemes. A full transposed convolution upsampling (two transposed convolution layers with stride 8 each) works almost as well as our bidirectional QRNNs, but it's much, much, more expensive in terms of compute and parameters, and takes longer to train as well.

As noted in the paper, we used many of the original features used for WaveNet before reducing our feature set. F0 is definitely important for proper intonation. We find that including the surrounding phonemes is quite important; with the bidirectional QRNN upsampling, leaving those out still works, but not nearly as well. It seems likely that a different conditioning network would remove the need for those "context" phonemes.

We have not yet used an encoder-decoder approach for duration or F0. Char2Wav has a bunch of interesting ideas, and it may be a direction for our future work. However, we do not plan on including the grapheme-to-phoneme model into our main model, because it's crucial that we easily affect the pronunciation of phonemes with a phoneme dictionary; by having an explicit grapheme-to-phoneme step, we can easily set the pronunciation for unseen words (like "P!nk" or "Worcestershire"; an integrated grapheme-to-phoneme model would not be able to do those, even humans usually cannot!).

We have not yet worked with speaker global conditioning, but it is likely that the results from the WaveNet paper apply to our WaveNet implementation as well.

Finally, as for sampling, we have not seen much variation due to random seed for a fully converged model. However, our intuition for why sampling is important is that the speech distribution is (a) multimodal and (b) biased towards silence. If you are interested, you can gain a little bit of intuition about what the distribution actually looks like by just plotting a color map across time, with high-probability values being bright and low probability values being dark; it generates a pretty plot, and you can see that some areas are clearly stochastic (especially fricatives) and some areas are multimodal (vowel wave peaks).

pain_perdu9y ago

How close (# years?)are we to being able to replicate the voices of any given individual with sufficient samples of their voiceprint?

PieSquaredOP9y ago

It's hard to say! We don't quite know exactly how many parameters or minutes of audio are needed to describe fully someone's voice and speaking patterns. Maybe one or two, maybe much more.

1 more reply

amelius9y ago

A simpler problem could be to identify someone based on voice. Is that problem already solved? And can we use this to solve the problem of generating someone's voice?

1 more reply

qq669y ago

"My voice is my passport."

2 more replies

Dowwie9y ago

now? https://www.youtube.com/watch?v=XfcqBElF0ZI

1 more reply

bmc75059y ago

Hi Andrew, congratulations on your result! A few questions, feel free to answer one or any. How close do you think you are to having fully end-to-end models for speech? Are you optimistic we can get speech synthesis to run on mobile devices in the near future? Do the inference optimizations (particularly sample embedding and layer inference) generalize well to other architectures, like speech recognition? It seems that if these models are going to run offline in realtime on mobile devices, we will need to have specialized hardware, but maybe we can squeeze enough performance out of mobile CPUs to get a highly optimized version to work. Thanks!

PieSquaredOP9y ago

Thank you!

For fully end-to-end models, it's hard to say exactly. The Char2Wav paper demonstrates that there is hypothetically an architecture and a set of weights that can do synthesis end-to-end, but we cannot yet train such a system. On Reddit, one of the Char2Wav authors comments that they tried training it directly and didn't get great results, and at SVAIL we've also had some trouble doing so. I think it is very likely going to happen in the next several months or year, but we don't yet know exactly what needs to happen in order to get it to work.

As for inference, some of the inference optimizations do generalize. In fact, the GPU optimizations (persistent kernels) were originally developed by our systems team, and published in the Persistent RNN [0] paper. (This is a really powerful technique that CUDA makes very hard to implement, and I have a massive amount of respect for the folks who managed to make it work!) Persistent RNNs make training at close-to-peak-FLOPs with very low batch sizes plausible, and make GPU WaveNet inference plausible. At the moment, our CPU kernels are much more promising, but we don't know whether that will stay the case. For mobile, I think it is possible to get the current systems to work on fairly powerful mobile CPUs with a bunch more work into optimization and low-level assembly, but we haven't done it yet so time will tell.

[0] https://svail.github.io/persistent_rnns/ and http://jmlr.org/proceedings/papers/v48/diamos16.pdf

phkahler9y ago

>> Are you optimistic we can get speech synthesis to run on mobile devices in the near future?

You mean high quality right? I mean speech synth has been around for decades that can run on cheap hardware and is understandable. Speech recognition has also been around for a long time, but there's a huge difference in usability between "pretty good recognition" and "pretty good synthesis". One is useful, the other not so much.

state_less9y ago

Nice job! The samples sound good.

Is there an implementation of this to check out? It seems like you needed to write some custom, low-level code to implement this in real-time. Which libraries did you use to generate the ANNs and do the inferences?

PieSquaredOP9y ago

We are not currently releasing any code, but hopefully the paper on arxiv is enough to make it easy to reproduce the result.

We use TensorFlow for writing and training the model and c++ with a lot of hand optimizations for inference, with assembly kernels written with PeachPy (which is an awesome piece of software!)

1 more reply

deepnotderp9y ago

Hey, one quick question, did the QRNNs work better and faster than LSTMs out of the box, or did you guys have to tune hyperparameters?

PieSquaredOP9y ago

We didn't actually try LSTMs, because we train in 1.25 second chunks, so running an LSTM for several hundred timesteps would drastically slow down training. Our per iteration time was in the 200-500 milliseconds, and using an LSTM or GRU would likely bump that into the 1-3 second range, maybe more, whereas the QRNN conditioning actually make it cheaper than the transposed convolution conditioning by 20-40%.

The upsampling procedure is quite finicky, so we had quite a few iterations there, but we didn't have to tune hyperparameters too much of the QRNN itself. Once we implemented the QRNN in CUDA for TensorFlow and got it to train, it worked without too much trouble.

Our collaborators in Beijing mentioned that bidirectional LSTMs also worked in a similar way, though.

1 more reply

slightlyCyborg9y ago

For those of us interested in this area of research what are the best papers and other resources for us to read? Has there been any success with deep approaches that do not have the WaveNet architecture?

PieSquaredOP9y ago

Check out Char2Wav (recent) and SampleRNN (the RNN-based audio synthesis architecture). The related work section of the Deep Voice paper mention a bunch of related papers that are relevant!

anjanb9y ago

This sounds cool. What would it take for me to build an Android App with this technology ? Do we have Android/java libraries ?

nojvek9y ago

How much computing power does this take. When do you see open source implementations running on mobile devices offline?

PieSquaredOP9y ago

We take several days (2-3) on 8 Titan X GPUs to train our models, which is quite a lot of compute. Running on mobile devices is quite challenging – the inference is not yet fast enough to support that, and has only been optimized for x86 AVX2 CPUs. It may be possible with a fair amount of future work!

1 more reply

mrmaximus9y ago· 7 in thread

Interesting. They are not TTS like we are accustomed to, they are replicating a specific persons voice with TTS. Listen to the ground-truth recordings at the bottom and then the synthesized versions above. "Fake News" is about to get a lot more compelling when you can make anyone say anything as long as you have some previous recordings of their voice.

modeless9y ago

> you can make anyone say anything as long as you have some previous recordings of their voice.

That's not what this is doing. They're simply resynthesizing exactly what the person said, in the same voice. It's essentially cheating because they can use the real person's inflection. Generating correct inflection is the hardest part of speech synthesis because doing it perfectly requires a complete understanding of the meaning of the text.

The top two are representative of what it sounds like when doing true text to speech. The middle five are just resynthesis of a clip saying the exact same thing. And even in that case, it doesn't always sound good. The fourth one is practically unintelligible. But it's interesting because it demonstrates an upper bound on the quality of the voice synthesis possible with their system given perfect inflection as input.

To clarify, this is cool work, the real-time aspect sounds great, and I'm sure it will lead to even more impressive results in the future. But I don't want people to think that all of the clips on this page represent their current text-to-speech quality.

PieSquaredOP9y ago

Thank you for clarifying this! We tried fairly hard to make this clear, because as you say, the hard part is generating inflection and duration that sounds natural. There's still a ton of work left to do in this duration – we're clearly nowhere near being able to generate human-level speech.

Our work is meant to make working with TTS easier to deep learning researchers by describing a complete and trainable system that can be trained completely from data, and demonstrate that the neural vocoder substitutes can actually be deployed to streaming production servers. Future work (both by us and hopefully other groups) will make further progress for inflection synthesis!

1 more reply

mrmaximus9y ago

>The top two are representative of what it sounds like when doing true text to speech. The middle five are just resynthesis of a clip saying the exact same thing.

Gotcha, now I understand.

phkahler9y ago

>> They're simply resynthesizing exactly what the person said, in the same voice. It's essentially cheating because they can use the real person's inflection.

Yes, but imagine being able to take sound from one person and inflection from another. If you want to fake someone saying something you don't need to do pure TTS, a human can be used to fake another persons inflections.

mrmaximus9y ago

Based upon what little is posted there, I thought they were taking the original recording, then training the model on that recording against the text of the recording... reproducing the recording. I would think next step is to sample enough audio and text to be able to produce new outputs entirely. It should in theory even be able to learn when/where/how to use inflection.

mbrookes9y ago

> "Fake News" is about to get a lot more compelling hen you can make anyone say anything as long as you have some previous recordings of their voice.

Adobe has already developed that technology:

https://arstechnica.co.uk/information-technology/2016/11/ado...

Now imagine combining it with this:

Face2Face: Real-time Face Capture and Reenactment of RGB Videos https://www.youtube.com/watch?v=ohmajJTcpNk

Perhaps using the intonation from the face-actor's voice to guide the speech synthesis.

stevenh9y ago

I agree and I've upvoted you, but I feel it's worth pointing out that Adobe's claim about their own progress in this field was fake news.

https://www.youtube.com/watch?v=I3l4XLZ59iw&t=2m34s

"Wife" sounds exactly the same in both places. All they did was copy the exact waveform from one point to another. Nothing is being synthesized.

https://www.youtube.com/watch?v=I3l4XLZ59iw&t=3m54s

The word "Jordan" is not being synthesized. The speaker was recorded saying "Jordan" beforehand for this insertion demo and they're trying to play it off as though it was synthesized on the fly. This is a scripted performance and Jordan is feigning surprise.

https://www.youtube.com/watch?v=I3l4XLZ59iw&t=4m40s

The phrase "three times" here was prerecorded.

This was a phony demonstration of a nonexistent product. Reporters parroted the claims and none questioned what they witnessed. Adobe falsely took credit and received endless free publicity for a breakthrough they had no hand in by staging this fake demo right on the heels of the genuine interest generated by Google WaveNet. I suppose they're hoping they'll have a real product ready by whatever deadline they've set for themselves.

To be clear, I like Adobe and I think it's a cunning move on their part.

1 more reply

dresaj89y ago· 7 in thread

does anyone know of good ways to do the opposite, speech to text?

ghaff9y ago

Not really. I keep my eye on this area as I generally transcribe my podcasts. But compared to ~$1.50/minute for human transcriptions that require minimal touchup for even fairly tech-heavy topics, nothing I've seen that's purely ML/computer-based comes close to being worth my time to deal with.

forthefuture9y ago

Depends on how good you're talking. Chrome supports the SpeechRecognition API.

https://developer.mozilla.org/en-US/docs/Web/API/SpeechRecog...

dresaj89y ago

i'm more thinking of ways to programmatically turn long audio files into indexable text.

2 more replies

cdgore9y ago

Google Cloud had a speech api and supports 80 languages. There is a demo: https://cloud.google.com/speech/

shanxS9y ago

Lex by AWS. Its the same deep learning tech. used as used by Alexa

dresaj89y ago

does lex actually translate speech to text for you? i was under the impression that it was for conversational bots.

1 more reply

gauthamsanthosh9y ago

Doesn't google api work for you ? I thought it worked perfect

computerwizard9y ago· 3 in thread

I have A LOT of pdf's I'd much rather listen to than read. Can't wait for this!

visarga9y ago

I hacked a script on top of PDF.js to make it read the text by TTS while highlighting the words on page. I'm a big fan of having the computer speak to me.

kpil9y ago

Unless you are doing some manual work at the same time, like ironing or something else that requires very little mental focus, I can't really see why?

anotheryou9y ago

You better try getting epubs. Ivona Amy us currently the best I know, I use it a lot on my phone.

slay2k9y ago· 2 in thread

How soon before you make an API available? In other words, how do I make use of Deep Voice for my own applications?

PieSquaredOP9y ago

Right now, we do not have plans to make an API available. This paper and blog post are mostly meant to describe our techniques to other deep learning researchers and spur innovation in the field. However, we hope that these techniques will be available eventually, and we'll provide more information when that happens.

rocky11389y ago

In order to not miss this announcement, do you have a mailing list we could sign up for to notify us when this becomes available? You have a LOT of people interested.

monk_e_boy9y ago· 2 in thread

OK, that went from uncanny valley to flipping amazing. I could picture the person speaking. An old lady. A young woman. It was hard to picture an algorithm in a machine.

It's amazing that is all boils down to 1s and 0s and some boolean logic.

taejavu9y ago

You've misunderstood what you're listening to, I suggest reading the post again.

The recordings at the bottom are just recordings of an old lady and a young woman.

monk_e_boy9y ago

Yeah, I understood that. The ones in the middle are generated using their voices. You don't find that amazing?

2 more replies

hprotagonist9y ago· 2 in thread

how does this stack up against wavenet?

methou9y ago

It's in the abstract. "... For the audio synthesis model, we implement a variant of WaveNet that requires fewer parameters and trains faster than the original ..."[1]

[1]: https://arxiv.org/abs/1702.07825

Smerity9y ago

Disclosure: I'm one of the co-authors of the QRNN paper (James Bradbury, Stephen Merity, Caiming Xiong, Richard Socher) produced by Salesforce Research.

There are many interesting advances that Deep Voice paper and implementation make but the part I'm excited by (and which might be transferable to other tasks that use RNNs) is showing that QRNNs are indeed generalizable to speech too - in this case in place of WaveNet.

"WaveNet uses transposed convolutions for upsampling and conditioning. We find that our models perform better, train faster, and require fewer parameters if we instead first encode the inputs with a stack of bidirectional quasi-RNN (QRNN) layers (Bradbury et al., 2016) and then perform upsampling by repetition to the desired frequency."

QRNNs are a variant of recurrent neural networks. They're up to 16 times faster than even Nvidia's highly optimized cuDNN LSTM implementation and give comparable or better accuracy in many tasks. This is the first time that it has been tried in speech - to see them note the advantages hold across the board (better, faster, smaller) is brilliant!

If you're interested in technical details, our blog post[1] provides a broader overview and our paper is available for deeper detail[2].

[1]: https://metamind.io/research/new-neural-network-building-blo...

[2]: https://arxiv.org/abs/1611.01576

100ideas9y ago· 1 in thread

> "We conclude that the main barrier to progress towards natural TTS lies with duration and fundamental frequency prediction, and our systems have not meaningfully progressed past the state of the art in that regard."

Who is working on this problem, and how?

RodolpheO9y ago

We're working on this. Here is a very early demo of Julian. Don't be surprised, he sounds like a teenager with a high-pitched voice, recorded in his bedroom, because that's how the sample library was recorded. https://soundcloud.com/komponant/julian-speech-demo NB the expressions (durations, F0) are manually adjusted, not predicted by a NN. We've built a fully flexible text-to-voice engine, not the brain that goes with it. But we're looking for people with experience in ML to work on this, so feel free to contact us.

chikiuso9y ago

That's great! when will the code / service be available to the public??

Elv139y ago

Semi-related to the Baidu speech research: http://chrislord.net/index.php/2017/02/23/machine-learning-s...

The work is done by Mozilla

Dowwie9y ago

Has anyone seen this yet? https://www.youtube.com/watch?v=XfcqBElF0ZI

So many innovations happening with voice related technology..

whodunser9y ago

It says they trained on 20 hours of a speech corpus subset. Will larger datasets influence the future of TTS?

m2106589y ago

very nice paper - one of my colleagues discovered it. I have been trying to understand the details but I do not see how your stacked dilated layers are arranged. "d" is mentioned once but no description given

ymow9y ago

it's awesome~

bayjingsf9y ago

Great work!

kayoone9y ago

if i understand this correctly it's a pretty big achievement on the way to being able to replicate any persons voice in the future given enough audio samples. Amazing. Similarly i have seen lip movement (talking) be replicated using machine learning. Having completely artificial (or even real) identities saying whatever you want them to on video is not that far off i guess (simpler than general AI or even fully self driving cars), which is both amazing and terrifying.

j / k navigate · click thread line to collapse

77 comments

59 comments · 16 top-level

PieSquaredOP9y ago· 19 in thread

Hey there! I'm one of the authors of the paper and I'm happy to answer any questions anyone may have!

Make sure to check out the paper on arxiv as well.

albertzeyer9y ago

Hi,

That is some very nice and interesting work! In fact, I have also worked on exactly the same thing, so I'm impressed by your accomplishments.

In the local conditioning network, you used QRNNs. Did you also try simpler methods, like just pure convolution? (And then the upsampling like you did, by nearest neighbor.)

Did you also try some global condition, like speaker identity?

Thanks, Albert

PieSquaredOP9y ago

Feel free to get in touch for more Q/A, my email is in my profile.

We have not yet worked with speaker global conditioning, but it is likely that the results from the WaveNet paper apply to our WaveNet implementation as well.

pain_perdu9y ago

How close (# years?)are we to being able to replicate the voices of any given individual with sufficient samples of their voiceprint?

PieSquaredOP9y ago

It's hard to say! We don't quite know exactly how many parameters or minutes of audio are needed to describe fully someone's voice and speaking patterns. Maybe one or two, maybe much more.

1 more reply

amelius9y ago

A simpler problem could be to identify someone based on voice. Is that problem already solved? And can we use this to solve the problem of generating someone's voice?

1 more reply

qq669y ago

"My voice is my passport."

2 more replies

Dowwie9y ago

now? https://www.youtube.com/watch?v=XfcqBElF0ZI

1 more reply

bmc75059y ago

PieSquaredOP9y ago

Thank you!

[0] https://svail.github.io/persistent_rnns/ and http://jmlr.org/proceedings/papers/v48/diamos16.pdf

phkahler9y ago

>> Are you optimistic we can get speech synthesis to run on mobile devices in the near future?

state_less9y ago

Nice job! The samples sound good.

PieSquaredOP9y ago

We are not currently releasing any code, but hopefully the paper on arxiv is enough to make it easy to reproduce the result.

We use TensorFlow for writing and training the model and c++ with a lot of hand optimizations for inference, with assembly kernels written with PeachPy (which is an awesome piece of software!)

1 more reply

deepnotderp9y ago

Hey, one quick question, did the QRNNs work better and faster than LSTMs out of the box, or did you guys have to tune hyperparameters?

PieSquaredOP9y ago

Our collaborators in Beijing mentioned that bidirectional LSTMs also worked in a similar way, though.

1 more reply

slightlyCyborg9y ago

PieSquaredOP9y ago

Check out Char2Wav (recent) and SampleRNN (the RNN-based audio synthesis architecture). The related work section of the Deep Voice paper mention a bunch of related papers that are relevant!

anjanb9y ago

This sounds cool. What would it take for me to build an Android App with this technology ? Do we have Android/java libraries ?

nojvek9y ago

How much computing power does this take. When do you see open source implementations running on mobile devices offline?

PieSquaredOP9y ago

1 more reply

mrmaximus9y ago· 7 in thread

modeless9y ago

> you can make anyone say anything as long as you have some previous recordings of their voice.

PieSquaredOP9y ago

1 more reply

mrmaximus9y ago

>The top two are representative of what it sounds like when doing true text to speech. The middle five are just resynthesis of a clip saying the exact same thing.

Gotcha, now I understand.

phkahler9y ago

>> They're simply resynthesizing exactly what the person said, in the same voice. It's essentially cheating because they can use the real person's inflection.

mrmaximus9y ago

mbrookes9y ago

> "Fake News" is about to get a lot more compelling hen you can make anyone say anything as long as you have some previous recordings of their voice.

Adobe has already developed that technology:

https://arstechnica.co.uk/information-technology/2016/11/ado...

Now imagine combining it with this:

Face2Face: Real-time Face Capture and Reenactment of RGB Videos https://www.youtube.com/watch?v=ohmajJTcpNk

Perhaps using the intonation from the face-actor's voice to guide the speech synthesis.

stevenh9y ago

I agree and I've upvoted you, but I feel it's worth pointing out that Adobe's claim about their own progress in this field was fake news.

https://www.youtube.com/watch?v=I3l4XLZ59iw&t=2m34s

"Wife" sounds exactly the same in both places. All they did was copy the exact waveform from one point to another. Nothing is being synthesized.

https://www.youtube.com/watch?v=I3l4XLZ59iw&t=3m54s

https://www.youtube.com/watch?v=I3l4XLZ59iw&t=4m40s

The phrase "three times" here was prerecorded.

To be clear, I like Adobe and I think it's a cunning move on their part.

1 more reply

dresaj89y ago· 7 in thread

does anyone know of good ways to do the opposite, speech to text?

ghaff9y ago

forthefuture9y ago

Depends on how good you're talking. Chrome supports the SpeechRecognition API.

https://developer.mozilla.org/en-US/docs/Web/API/SpeechRecog...

dresaj89y ago

i'm more thinking of ways to programmatically turn long audio files into indexable text.

2 more replies

cdgore9y ago

Google Cloud had a speech api and supports 80 languages. There is a demo: https://cloud.google.com/speech/

shanxS9y ago

Lex by AWS. Its the same deep learning tech. used as used by Alexa

dresaj89y ago

does lex actually translate speech to text for you? i was under the impression that it was for conversational bots.

1 more reply

gauthamsanthosh9y ago

Doesn't google api work for you ? I thought it worked perfect

computerwizard9y ago· 3 in thread

I have A LOT of pdf's I'd much rather listen to than read. Can't wait for this!

visarga9y ago

I hacked a script on top of PDF.js to make it read the text by TTS while highlighting the words on page. I'm a big fan of having the computer speak to me.

kpil9y ago

Unless you are doing some manual work at the same time, like ironing or something else that requires very little mental focus, I can't really see why?

anotheryou9y ago

You better try getting epubs. Ivona Amy us currently the best I know, I use it a lot on my phone.

slay2k9y ago· 2 in thread

How soon before you make an API available? In other words, how do I make use of Deep Voice for my own applications?

PieSquaredOP9y ago

rocky11389y ago

In order to not miss this announcement, do you have a mailing list we could sign up for to notify us when this becomes available? You have a LOT of people interested.

monk_e_boy9y ago· 2 in thread

OK, that went from uncanny valley to flipping amazing. I could picture the person speaking. An old lady. A young woman. It was hard to picture an algorithm in a machine.

It's amazing that is all boils down to 1s and 0s and some boolean logic.

taejavu9y ago

You've misunderstood what you're listening to, I suggest reading the post again.

The recordings at the bottom are just recordings of an old lady and a young woman.

monk_e_boy9y ago

Yeah, I understood that. The ones in the middle are generated using their voices. You don't find that amazing?

2 more replies

hprotagonist9y ago· 2 in thread

how does this stack up against wavenet?

methou9y ago

It's in the abstract. "... For the audio synthesis model, we implement a variant of WaveNet that requires fewer parameters and trains faster than the original ..."[1]

[1]: https://arxiv.org/abs/1702.07825

Smerity9y ago

Disclosure: I'm one of the co-authors of the QRNN paper (James Bradbury, Stephen Merity, Caiming Xiong, Richard Socher) produced by Salesforce Research.

If you're interested in technical details, our blog post[1] provides a broader overview and our paper is available for deeper detail[2].

[1]: https://metamind.io/research/new-neural-network-building-blo...

[2]: https://arxiv.org/abs/1611.01576

100ideas9y ago· 1 in thread

Who is working on this problem, and how?

RodolpheO9y ago

chikiuso9y ago

That's great! when will the code / service be available to the public??

Elv139y ago

Semi-related to the Baidu speech research: http://chrislord.net/index.php/2017/02/23/machine-learning-s...

The work is done by Mozilla

Dowwie9y ago