So, I'm not the only one seeing this issue. It seems like many recent AI papers want to look as impressive as possible, wile giving you as little implementation info as possible. This bothers me, because it opposes the very purpose of research publication.
[1] http://niclane.org/pubs/deepx_ipsn.pdf
[2] https://www.ibr.cs.tu-bs.de/Cosdeo2016/talks/invitedTalk.pdf
Props to the author, and especially to the DeepMind researchers who published their work! I look forward to living in a world where this type of technology is ubiquitous and mostly commoditized.
[1] http://cmusphinx.sourceforge.net/2016/04/grapheme-to-phoneme...
A little bit off-topic, but do you know any recent work or paper for speech recognition in language teaching area ? (I mean, analysing and rating accuracy of speaker, detect incorrect pronunciation of phones, and so on)
All of the results come back gibberish. The results in the training data seem just fine. Curious if you've tested the above to ensure it didn't overfit.
"Second, the Paper added a mean-pooling layer after the dilated convolution layer for down-sampling. We extracted MFCC from wav files and removed the final mean-pooling layer because the original setting was impossible to run on our TitanX GPU." [1]
[1] https://github.com/buriburisuri/speech-to-text-wavenet#speec...
Perhaps future communication applications can have a WaveNet on either end, which learns the voice of the person you're communicating with and then only sends text after a certain point in the conversation?
I'm coming at this from a point of ignorance though, so correct me if I've made erroneous assumptions.
This could have interesting implications for Foley-artists of the 21st century.
How likely would such a tech help lower budget companies who want to implement voice communication within their software, say for video games or similar?
Hmm, now this has me wondering what implications this has for voice acting as well.
EDIT: We can call the ambient sound symbols sent over the wire "Soundmojis" or "amojis" or "audiomojis"
It doesn't seem like the mainstream engines (Alexa, Google Voice, Siri) are context aware. Why not?
This is what I'm solving at Optik. Helping you manage the things that you care about in the place that you are, and NOT exposing your personal details to cloud computation.
I had a go at implementing wave->phoneme recognition using a simple neural net and it seemed to work pretty well.
Does anyone on HN do active research in this field? Could I pick your brain for a survey of the best papers (especially review papers) on the subject?
Paper, yes. [1] Source code, no.
Anyone want to volunteer a few weeks of GPU time to train this better?