Speech-to-Text-WaveNet: End-to-end sentence level English speech recognition (opens in new tab)

(github.com)

172 pointsdudisbrie9y ago42 comments

42 comments

32 comments · 15 top-level

bmc75059y ago· 3 in thread

A few weeks ago, a deep learning researcher at one of the world's leading speech groups told me off-the-record that offline, human-parity speech recognition would be "coming soon" to mobile devices. Not sure s/he realized just how soon that would be. Even though state-of-the-art ASR is really expensive to train, recognition is extremely cheap to run, even on lower-power devices. [1][2] With specialized silicon, you can do this, continuously, for free, on something like a smartwatch. You don't need to open a websocket or call an API running on some beefy server to do this, speech-to-text is now a basic commodity. Fully offline, ubiquitous speech recognition is right around the corner. With human-level speech synthesis [3], speech applications are going to get very interesting, very quickly.

[1] http://niclane.org/pubs/deepx_ipsn.pdf

[2] https://www.ibr.cs.tu-bs.de/Cosdeo2016/talks/invitedTalk.pdf

[3] https://github.com/ibab/tensorflow-wavenet

braindead_in9y ago

A consumer focused human parity ASR service will disrupt so many industries, including mine. I run a human powered transcription service where we transcribe files with high accuracy. I am just waiting for the day when our transcribers can work off a auto-generated transcript instead of typing it all up manually. I'll pay good money for a service where I can just send a file and get a 80-90% accurate transcript with speaker diarization.

skoocda9y ago

We've chatted - just an update that I'm implementing diarization this weekend :)

imaginenore9y ago

I hope you realize your business is about to go out of business. The only reason you can charge people now is because the automatic recognition sucks compared to humans.

1 more reply

RandomInteger49y ago· 3 in thread

How much Bandwidth is consumed from voice communications such as when speaking to someone on Skype or over the phone, vs. the same words transmitted via text?

Perhaps future communication applications can have a WaveNet on either end, which learns the voice of the person you're communicating with and then only sends text after a certain point in the conversation?

I'm coming at this from a point of ignorance though, so correct me if I've made erroneous assumptions.

dest9y ago

text communication is much lighter (a few bytes/s vs kb/s) but you may miss the non verbal contents of voice

RandomInteger49y ago

By non-verbal do you mean like ambient sound? Dogs barking, child yelling, garbage truck garbage trucking? I don't know. If they can do voice, then it might be possible to do ambient sounds of there is a separate nets trained with a library of ambient sounds where it's tuned not to be the same every time the sound plays like how when you have tiled graphics, there are algorithms that remove the unnatural sameness from one tile to the next.

This could have interesting implications for Foley-artists of the 21st century.

How likely would such a tech help lower budget companies who want to implement voice communication within their software, say for video games or similar?

Hmm, now this has me wondering what implications this has for voice acting as well.

EDIT: We can call the ambient sound symbols sent over the wire "Soundmojis" or "amojis" or "audiomojis"

1 more reply

kondro9y ago

Less than 8kbps in most voice. It pales in comparison to the quantity of bandwidth consumed each day on video.

gambler9y ago· 2 in thread

"Some of Deepmind's recent papers are tricky to reproduce. The Paper also omitted specific details about the implementation, and we had to fill the gaps in our own way."

So, I'm not the only one seeing this issue. It seems like many recent AI papers want to look as impressive as possible, wile giving you as little implementation info as possible. This bothers me, because it opposes the very purpose of research publication.

deepnotderp9y ago

This is more specific to deepmind actually, Facebook and others have been pretty good about publishing code.

CamperBob29y ago

Unfortunately I think you'll find similar complaints in every scientific field. Often, results either aren't described well enough to be reproduced, they're too expensive or difficult to reproduce, or they rely on closed-source software and/or inadequately-documented hardware.

kcorbitt9y ago· 2 in thread

This is really exciting. I previously worked at a startup for that could have benefited enormously from even 90% accurate speech recognition. As of six months ago when I last looked, there were no open source speech-to-text libraries with anything approaching the performance of the proprietary work by Google, Microsoft, Baidu, etc. The closest thing was CMU Sphinx, but its accuracy was unacceptable.

Props to the author, and especially to the DeepMind researchers who published their work! I look forward to living in a world where this type of technology is ubiquitous and mostly commoditized.

bmc75059y ago

The CMU Sphinx project as it stands is basically dead. Even though they recently implemented some sequence-to-sequence deep learning techniques for g2p [1], the core stack is still based on an ancient GMM/HMM pipeline, and current state of the art projects (even open source ones) have leapfrogged it in terms of accuracy. If you're implementing offline speech recognition today, start with something like this or Kaldi-ASR [2]. It will take a bit of work to get your models to running on a mobile device, but the end result will be much more usable.

[1] http://cmusphinx.sourceforge.net/2016/04/grapheme-to-phoneme...

[2] http://kaldi-asr.org/

snadal9y ago

We've worked in the past with CMU Sphinx too, and it is absolutely amazing the advances in this area in the last months.

A little bit off-topic, but do you know any recent work or paper for speech recognition in language teaching area ? (I mean, analysing and rating accuracy of speaker, detect incorrect pronunciation of phones, and so on)

1 more reply

throwaway133379y ago· 2 in thread

This seems super useful for most speech recognition - understanding context.

It doesn't seem like the mainstream engines (Alexa, Google Voice, Siri) are context aware. Why not?

doublerebel9y ago

Context involves location, which 99% of the time those bots don't take into consideration. Context does not involve knowing everything about your email or being able to search the entire web. It's much more connected to what you just did and where and when you are doing it.

This is what I'm solving at Optik. Helping you manage the things that you care about in the place that you are, and NOT exposing your personal details to cloud computation.

EGreg9y ago

Also why can't we track emails sent from our iOS device like we can with desktop GMail plugins??

teajunky9y ago· 2 in thread

Wow train.py contains only 83 lines of code (including a few empty lines and commets). And recognize.py is only litte bit longer with 108 lines. Very impressive.

bra-ket9y ago

typical of machine learning, a whole lot of talking about a few lines of code

hyperbovine9y ago

FFT is 4 lines, what is your point.

craigbaker9y ago· 1 in thread

Is this really speech recognition from raw waveforms? It looks like they're extracting MFCC features from the raw audio, and using that as input to the neural network. I thought that the point of WaveNet was that it took the raw waveform directly as input, unlike previous architectures which first extract spectral features such as MFCCs to use as the input.

bmc75059y ago

Apparently, they tried to use the raw audio waveform with the original setup from the WaveNet paper but couldn't get it to train on their TitanX, so they used MFCCs instead. It's not exactly clear why this is the case.

"Second, the Paper added a mean-pooling layer after the dilated convolution layer for down-sampling. We extracted MFCC from wav files and removed the final mean-pooling layer because the original setting was impossible to run on our TitanX GPU." [1]

[1] https://github.com/buriburisuri/speech-to-text-wavenet#speec...

echelon9y ago· 1 in thread

Did the original WaveNet text to speech demo come with a paper or source code? (I didn't see either.) I'm interested in techniques, particularly neural network-related, to improve the quality of my Donald Trump text to speech engine [1].

Does anyone on HN do active research in this field? Could I pick your brain for a survey of the best papers (especially review papers) on the subject?

[1] http://jungle.horse

bmc75059y ago

> Did the original WaveNet text to speech demo come with a paper or source code?

Paper, yes. [1] Source code, no.

[1] https://arxiv.org/pdf/1609.03499.pdf

londons_explore9y ago· 1 in thread

Looking at the training loss graph, it looks like training for more time would produce even better results...

Anyone want to volunteer a few weeks of GPU time to train this better?

gwern9y ago

Training loss pretty much always decreases. NNs are extremely powerful models, so they can overfit most data. What you want to see is the validation loss graph.

brandoncarl9y ago

To the authors: did you any of your own recordings? I've used my own and clips online, in WAV and other formats, at various sampling rates.

All of the results come back gibberish. The results in the training data seem just fine. Curious if you've tested the above to ensure it didn't overfit.

IshKebab9y ago

Can someone explain why MFCC is used rather than allowing the neural network to learn from the raw waveform? I looked back in the literature and the intention of MFCC & PLP seems to be to remove speaker-dependent features from the audio in order to reduce the dimensionality of the input. But I though the whole point of neural nets is that they can learn from very high dimensional inputs no?

I had a go at implementing wave->phoneme recognition using a simple neural net and it seemed to work pretty well.

Karlozkiller9y ago

This is exactly what I would have wanted for my master thesis about half a year ago, where I wanted to use s2t with good control over the system without having to implement everything myself.

mo1ok9y ago

This is awesome. I was just reading the waveNet paper and wondering how would go about a DIY approach...

EGreg9y ago

Does this require an internet connection, though? Relative to say OpenEars?

amelius9y ago

Perhaps now finally Linux could get a speech recognition input device.

j / k navigate · click thread line to collapse

42 comments

32 comments · 15 top-level

bmc75059y ago· 3 in thread

[1] http://niclane.org/pubs/deepx_ipsn.pdf

[2] https://www.ibr.cs.tu-bs.de/Cosdeo2016/talks/invitedTalk.pdf

[3] https://github.com/ibab/tensorflow-wavenet

braindead_in9y ago

skoocda9y ago

We've chatted - just an update that I'm implementing diarization this weekend :)

imaginenore9y ago

I hope you realize your business is about to go out of business. The only reason you can charge people now is because the automatic recognition sucks compared to humans.

1 more reply

RandomInteger49y ago· 3 in thread

How much Bandwidth is consumed from voice communications such as when speaking to someone on Skype or over the phone, vs. the same words transmitted via text?

I'm coming at this from a point of ignorance though, so correct me if I've made erroneous assumptions.

dest9y ago

text communication is much lighter (a few bytes/s vs kb/s) but you may miss the non verbal contents of voice

RandomInteger49y ago

This could have interesting implications for Foley-artists of the 21st century.

How likely would such a tech help lower budget companies who want to implement voice communication within their software, say for video games or similar?

Hmm, now this has me wondering what implications this has for voice acting as well.

EDIT: We can call the ambient sound symbols sent over the wire "Soundmojis" or "amojis" or "audiomojis"

1 more reply

kondro9y ago

Less than 8kbps in most voice. It pales in comparison to the quantity of bandwidth consumed each day on video.

gambler9y ago· 2 in thread

"Some of Deepmind's recent papers are tricky to reproduce. The Paper also omitted specific details about the implementation, and we had to fill the gaps in our own way."

deepnotderp9y ago

This is more specific to deepmind actually, Facebook and others have been pretty good about publishing code.

CamperBob29y ago

kcorbitt9y ago· 2 in thread

Props to the author, and especially to the DeepMind researchers who published their work! I look forward to living in a world where this type of technology is ubiquitous and mostly commoditized.

bmc75059y ago

[1] http://cmusphinx.sourceforge.net/2016/04/grapheme-to-phoneme...

[2] http://kaldi-asr.org/

snadal9y ago

We've worked in the past with CMU Sphinx too, and it is absolutely amazing the advances in this area in the last months.

1 more reply

throwaway133379y ago· 2 in thread

This seems super useful for most speech recognition - understanding context.

It doesn't seem like the mainstream engines (Alexa, Google Voice, Siri) are context aware. Why not?

doublerebel9y ago

This is what I'm solving at Optik. Helping you manage the things that you care about in the place that you are, and NOT exposing your personal details to cloud computation.

EGreg9y ago

Also why can't we track emails sent from our iOS device like we can with desktop GMail plugins??

teajunky9y ago· 2 in thread

Wow train.py contains only 83 lines of code (including a few empty lines and commets). And recognize.py is only litte bit longer with 108 lines. Very impressive.

bra-ket9y ago

typical of machine learning, a whole lot of talking about a few lines of code

hyperbovine9y ago

FFT is 4 lines, what is your point.

craigbaker9y ago· 1 in thread

bmc75059y ago

[1] https://github.com/buriburisuri/speech-to-text-wavenet#speec...

echelon9y ago· 1 in thread

Does anyone on HN do active research in this field? Could I pick your brain for a survey of the best papers (especially review papers) on the subject?

[1] http://jungle.horse

bmc75059y ago

> Did the original WaveNet text to speech demo come with a paper or source code?

Paper, yes. [1] Source code, no.

[1] https://arxiv.org/pdf/1609.03499.pdf

londons_explore9y ago· 1 in thread

Looking at the training loss graph, it looks like training for more time would produce even better results...

Anyone want to volunteer a few weeks of GPU time to train this better?

gwern9y ago

Training loss pretty much always decreases. NNs are extremely powerful models, so they can overfit most data. What you want to see is the validation loss graph.

brandoncarl9y ago

To the authors: did you any of your own recordings? I've used my own and clips online, in WAV and other formats, at various sampling rates.

All of the results come back gibberish. The results in the training data seem just fine. Curious if you've tested the above to ensure it didn't overfit.

IshKebab9y ago

I had a go at implementing wave->phoneme recognition using a simple neural net and it seemed to work pretty well.

Karlozkiller9y ago

This is exactly what I would have wanted for my master thesis about half a year ago, where I wanted to use s2t with good control over the system without having to implement everything myself.

mo1ok9y ago

This is awesome. I was just reading the waveNet paper and wondering how would go about a DIY approach...

EGreg9y ago

Does this require an internet connection, though? Relative to say OpenEars?

amelius9y ago

Perhaps now finally Linux could get a speech recognition input device.

j / k navigate · click thread line to collapse