Make sure to check out the paper on arxiv as well.
That is some very nice and interesting work! In fact, I have also worked on exactly the same thing, so I'm impressed by your accomplishments.
How much have you played around with different local condition features, i.e. the phoneme signal? Was it always with 256 Hz? Have you always used nearest-neighbor for upsampling to 16 kHz? Have you always used those 2 + (1 + 2 + 2) * (40 + 5) = 227 dimensions? We tried just with 39 dimensional phonemes, which also worked but the quality was not so nice and it sounded very robotic, probably due to missing F0. We also only had 100 Hz, but we tried some variants to upscale it to 16 kHz, like linear interpolation or deconv or combinations of them.
In the local conditioning network, you used QRNNs. Did you also try simpler methods, like just pure convolution? (And then the upsampling like you did, by nearest neighbor.)
You are predicting phone duration + F0. Have you also tried an encoder-decoder approach instead, like in Char2Wav? I.e. instead of the duration prediction, you let the decoder unroll it. Then, also like Char2Wav, you can also combine that directly with your Grapheme-to-Phoneme model. Have you tried that?
Did you also try some global condition, like speaker identity?
We also tried all the sampling methods you are listing and observed the same behavior, i.e. only the direct sampling really works. I tried many more deterministic variants (like taking mean) but none of them worked. This is a bit strange. Also the quality can vary depending on the random seed.
Thanks, Albert
We've experimented a bunch with many of these hyperparameters. Our phoneme signal has mostly stayed 256 Hz, but we've done a few experiments with lower-frequency signals that indicate it's probably possible to reduce it.
We have used many types of upsampling, and find that the upsampling and conditioning procedure does not affect the quality of the audio itself, but does affect the frequency of pronunciation mistakes. We used bicubic and bilinear interpolation based upsampling, as well as transposed convolutions and a variety of other simpler convolutions (for example, per-channel transposed convolutions). These tend to work and converge, but then generate pronunciation mistakes on difficult phonemes. A full transposed convolution upsampling (two transposed convolution layers with stride 8 each) works almost as well as our bidirectional QRNNs, but it's much, much, more expensive in terms of compute and parameters, and takes longer to train as well.
As noted in the paper, we used many of the original features used for WaveNet before reducing our feature set. F0 is definitely important for proper intonation. We find that including the surrounding phonemes is quite important; with the bidirectional QRNN upsampling, leaving those out still works, but not nearly as well. It seems likely that a different conditioning network would remove the need for those "context" phonemes.
We have not yet used an encoder-decoder approach for duration or F0. Char2Wav has a bunch of interesting ideas, and it may be a direction for our future work. However, we do not plan on including the grapheme-to-phoneme model into our main model, because it's crucial that we easily affect the pronunciation of phonemes with a phoneme dictionary; by having an explicit grapheme-to-phoneme step, we can easily set the pronunciation for unseen words (like "P!nk" or "Worcestershire"; an integrated grapheme-to-phoneme model would not be able to do those, even humans usually cannot!).
We have not yet worked with speaker global conditioning, but it is likely that the results from the WaveNet paper apply to our WaveNet implementation as well.
Finally, as for sampling, we have not seen much variation due to random seed for a fully converged model. However, our intuition for why sampling is important is that the speech distribution is (a) multimodal and (b) biased towards silence. If you are interested, you can gain a little bit of intuition about what the distribution actually looks like by just plotting a color map across time, with high-probability values being bright and low probability values being dark; it generates a pretty plot, and you can see that some areas are clearly stochastic (especially fricatives) and some areas are multimodal (vowel wave peaks).
For fully end-to-end models, it's hard to say exactly. The Char2Wav paper demonstrates that there is hypothetically an architecture and a set of weights that can do synthesis end-to-end, but we cannot yet train such a system. On Reddit, one of the Char2Wav authors comments that they tried training it directly and didn't get great results, and at SVAIL we've also had some trouble doing so. I think it is very likely going to happen in the next several months or year, but we don't yet know exactly what needs to happen in order to get it to work.
As for inference, some of the inference optimizations do generalize. In fact, the GPU optimizations (persistent kernels) were originally developed by our systems team, and published in the Persistent RNN [0] paper. (This is a really powerful technique that CUDA makes very hard to implement, and I have a massive amount of respect for the folks who managed to make it work!) Persistent RNNs make training at close-to-peak-FLOPs with very low batch sizes plausible, and make GPU WaveNet inference plausible. At the moment, our CPU kernels are much more promising, but we don't know whether that will stay the case. For mobile, I think it is possible to get the current systems to work on fairly powerful mobile CPUs with a bunch more work into optimization and low-level assembly, but we haven't done it yet so time will tell.
[0] https://svail.github.io/persistent_rnns/ and http://jmlr.org/proceedings/papers/v48/diamos16.pdf
You mean high quality right? I mean speech synth has been around for decades that can run on cheap hardware and is understandable. Speech recognition has also been around for a long time, but there's a huge difference in usability between "pretty good recognition" and "pretty good synthesis". One is useful, the other not so much.
Is there an implementation of this to check out? It seems like you needed to write some custom, low-level code to implement this in real-time. Which libraries did you use to generate the ANNs and do the inferences?
We use TensorFlow for writing and training the model and c++ with a lot of hand optimizations for inference, with assembly kernels written with PeachPy (which is an awesome piece of software!)
The upsampling procedure is quite finicky, so we had quite a few iterations there, but we didn't have to tune hyperparameters too much of the QRNN itself. Once we implemented the QRNN in CUDA for TensorFlow and got it to train, it worked without too much trouble.
Our collaborators in Beijing mentioned that bidirectional LSTMs also worked in a similar way, though.
That's not what this is doing. They're simply resynthesizing exactly what the person said, in the same voice. It's essentially cheating because they can use the real person's inflection. Generating correct inflection is the hardest part of speech synthesis because doing it perfectly requires a complete understanding of the meaning of the text.
The top two are representative of what it sounds like when doing true text to speech. The middle five are just resynthesis of a clip saying the exact same thing. And even in that case, it doesn't always sound good. The fourth one is practically unintelligible. But it's interesting because it demonstrates an upper bound on the quality of the voice synthesis possible with their system given perfect inflection as input.
To clarify, this is cool work, the real-time aspect sounds great, and I'm sure it will lead to even more impressive results in the future. But I don't want people to think that all of the clips on this page represent their current text-to-speech quality.
Our work is meant to make working with TTS easier to deep learning researchers by describing a complete and trainable system that can be trained completely from data, and demonstrate that the neural vocoder substitutes can actually be deployed to streaming production servers. Future work (both by us and hopefully other groups) will make further progress for inflection synthesis!
Gotcha, now I understand.
Yes, but imagine being able to take sound from one person and inflection from another. If you want to fake someone saying something you don't need to do pure TTS, a human can be used to fake another persons inflections.
Adobe has already developed that technology:
https://arstechnica.co.uk/information-technology/2016/11/ado...
Now imagine combining it with this:
Face2Face: Real-time Face Capture and Reenactment of RGB Videos https://www.youtube.com/watch?v=ohmajJTcpNk
Perhaps using the intonation from the face-actor's voice to guide the speech synthesis.
https://www.youtube.com/watch?v=I3l4XLZ59iw&t=2m34s
"Wife" sounds exactly the same in both places. All they did was copy the exact waveform from one point to another. Nothing is being synthesized.
https://www.youtube.com/watch?v=I3l4XLZ59iw&t=3m54s
The word "Jordan" is not being synthesized. The speaker was recorded saying "Jordan" beforehand for this insertion demo and they're trying to play it off as though it was synthesized on the fly. This is a scripted performance and Jordan is feigning surprise.
https://www.youtube.com/watch?v=I3l4XLZ59iw&t=4m40s
The phrase "three times" here was prerecorded.
This was a phony demonstration of a nonexistent product. Reporters parroted the claims and none questioned what they witnessed. Adobe falsely took credit and received endless free publicity for a breakthrough they had no hand in by staging this fake demo right on the heels of the genuine interest generated by Google WaveNet. I suppose they're hoping they'll have a real product ready by whatever deadline they've set for themselves.
To be clear, I like Adobe and I think it's a cunning move on their part.
The work is done by Mozilla
https://developer.mozilla.org/en-US/docs/Web/API/SpeechRecog...
Who is working on this problem, and how?
It's amazing that is all boils down to 1s and 0s and some boolean logic.
The recordings at the bottom are just recordings of an old lady and a young woman.
So many innovations happening with voice related technology..
There are many interesting advances that Deep Voice paper and implementation make but the part I'm excited by (and which might be transferable to other tasks that use RNNs) is showing that QRNNs are indeed generalizable to speech too - in this case in place of WaveNet.
"WaveNet uses transposed convolutions for upsampling and conditioning. We find that our models perform better, train faster, and require fewer parameters if we instead first encode the inputs with a stack of bidirectional quasi-RNN (QRNN) layers (Bradbury et al., 2016) and then perform upsampling by repetition to the desired frequency."
QRNNs are a variant of recurrent neural networks. They're up to 16 times faster than even Nvidia's highly optimized cuDNN LSTM implementation and give comparable or better accuracy in many tasks. This is the first time that it has been tried in speech - to see them note the advantages hold across the board (better, faster, smaller) is brilliant!
If you're interested in technical details, our blog post[1] provides a broader overview and our paper is available for deeper detail[2].
[1]: https://metamind.io/research/new-neural-network-building-blo...