Microsoft Research make breakthrough in audio speech recognition (opens in new tab)

(blogs.technet.com)

138 pointssparknlaunch14y ago44 comments

44 comments

36 comments · 11 top-level

cmicali14y ago· 6 in thread

Vlingo, Siri, and others have been doing speaker independent auto-adapting speech recognition for years and talking about systems requiring 'training' and improvements there sound like this article is 5 years old. Great to see innovation in this space but this article is very light on detail.

breckinloggins14y ago

It is my understanding (albeit based on limited knowledge) that Siri, like other Nuance-powered systems that make a call to the server, are actually "trained" continuously by the huge amount of sample speech they receive by real users.

The true "breakthrough" here would be if Microsoft made a voice recognition system that could run entirely on a device (no internet connection needed) and accurately understand speech without terabytes of training data or a local user training session. I can't tell from the article if this is what Microsoft is claiming.

Also, it appears that "Deep Neural Network" isn't the most common term of art here. DNN appears to be a synonym for "Deep Belief Network".[1] Can anyone confirm?

[1] http://www.scholarpedia.org/article/Deep_belief_networks

rck14y ago

I believe that in this system, "deep neural network" just means a regular feed-forward network that has a larger number of hidden layers. There is a relationship to DBNs though, because they initialize the weights of the neural net by doing unsupervised pre-training with a set of DBNs.

gdahl14y ago

The term "Deep Belief Network" has been abused in the literature (not pointing fingers, I've done it too). The DNNs used mean a neural net pre-trained with RBMs. Sometimes, when people say DBN, that is also what they mean. But really a DBN is a particular graphical model with undirected connections between the top two layers and directed connections everywhere else. The confusion comes from the pre-training procedure. The pre-training creates a DBN, which is then used to initialize the weights of a standard feedforward neural net. Then the DBN is discarded. It is a somewhat pedantic distinction. Since DBN is already an overloaded acronym (Dynamic Bayes Net) in the speech community and not entirely accurate for the pedantic reason I just mentioned, we decided to go with the DNN acronym.

ezy14y ago

As you might guess, they are not claiming this.

They basically are using a new (in the context of speech rec) technique that seems to improve accuracy by 16% relative on their test data (and using their code :-)). It's a really great result, but it doesn't change the basic nature of a state of the art speech recognizer at all -- you still need to train and adapt it -- and it still needs lots and lots of data.

knewter14y ago

Vlingo has LITERALLY never gotten anything I said right, ever. Just a data point.

mikeash14y ago

That jumped out at me as well. Speaker-independent systems are most certainly not limited to small vocabularies or pre-baked input patterns anymore. There is certainly room for a great deal of improvement, but it's in accuracy, not simply the ability to do generalized speaker-independent input at all.

Dn_Ab14y ago· 4 in thread

For those keeping score, google's image feature extractor shares the same core principles as microsoft's speech recognizer.

EDIT: by keeping score I mean keeping track of which techniques are being used where.

no_more_death14y ago

Am I the only one who gets tired of people keeping score like this? Can't we just accept that many of the large companies are seriously innovative?

(Sorry, I know I'm being cranky)

mikedmiked14y ago

What are these core principles?

Dn_Ab14y ago

The main characters of both papers are many layered neural network architectures, autoencoders and stochastic gradient descent. The interesting thing is that all these ideas are from the 80s but the breakthrough was in how to use unsupervised learning to seed neural networks so that a many layered neural network did not get mired in local optima.

The key idea is that if you train each layer in an unsupervised manner and then feed its outputs as features for the next layer it performs better when you go on to train it in a supervised way. That is, back-propagation on the pre-trained Neural net, learns a far more robust set of weights than without pretraining. Stochastic gradient descent is a very simple technique that is useful for optimization when you are working with massive data.

The architecture Dahl used layers as RBM (very similar to autoencoders) to seed a regular ole but many layered Feedforward network. SGD is used to do back propagation. RBMs themselves are trained using a generative technique - see Contrastive divergence for more.

The google architecture is more complex and based on biological models. It is not trying to learn an explicit classifier hence they train a many layered autoencoder network to learn features. I only skimmed the paper but they have multiple layers specialized to a particular type of processing (think photoshop not intel) and using SGD they optimize an objective that is essentially learning an effective decomposition on the data.

The main takeaway is if you can find an effective way to build layered abstractions then you will learn robustly.

2 more replies

aheilbut14y ago

Do you mean the system behind the google images search?

kpozin14y ago· 3 in thread

The demo site (http://www.msravs.com/audiosearch_demo/) blocks browsers other than IE and Firefox based on the user agent string. Use WebKit's developer tools to change your user agent and you'll be able to get in.

no_more_death14y ago

Why alienate such a large segment of users after pouring so much money into their technology? The web is getting weirder.

If a company invests in multiple markets, they should be prepared to do well in some markets and badly in others. Bing isn't as good as Google. Android isn't as well-designed as Metro. Yes, Android stole Apple's market, and, yes, Apple stole someone else's market. The large technology companies are deadlocked on multiple fronts. That fuels fierce competition and inspires excellence and choice. However, companies should accept they just aren't the best at everything. Let us make our own choices based on what's best for us.

georgemcbay14y ago

I think you are attributing to malice what is probably just laziness. It is fairly common for modern websites to drop the ball on support of some browser or other. I doubt Microsoft as a corporation made a deliberate decision to support IE and Firefox but not Chrome or Safari or Opera or whatever.

2 more replies

epo14y ago

"Android stole Apple's market" eh? Must be why Apple is losing money hand over fist. Apple fans can be annoying but fandroids are often detached from reality.

1 more reply

bornhuetter14y ago· 3 in thread

Can someone please explain senones to me? Can't find much on Google.

The article says that they are a fragment of a phoneme, but how small a fragment are we talking? 2-3 per phoneme, or many more?

Also - I'd be curious how much the phoneme in a word can vary based on accent.

dewiz14y ago

http://cmusphinx.sourceforge.net/wiki/tutorialconcepts

"Speech is a continuous audio stream where rather stable states mix with dynamically changed states. In this sequence of states, one can define more or less similar classes of sounds, or phones.

Words are understood to be built of phones, but this is certainly not true. The acoustic properties of a waveform corresponding to a phone can vary greatly depending on many factors - phone context, speaker, style of speech and so on. The so called coarticulation makes phones sound very different from their “canonical” representation. Next, since transitions between words are more informative than stable regions, developers often talk about diphones - parts of phones between two consecutive phones. Sometimes developers talk about subphonetic units - different substates of a phone. Often three or more regions of a different nature can easily be found.

The number three is easily explained. The first part of the phone depends on its preceding phone, the middle part is stable, and the next part depends on the subsequent phone. That's why there are often three states in a phone selected for HMM recognition.

Sometimes phones are considered in context. There are triphones or even quinphones. But note that unlike phones and diphones, they are matched with the same range in waveform as just phones. They just differ by name. That's why we prefer to call this object senone. A senone's dependence on context could be more complex than just left and right context. It can be a rather complex function defined by a decision tree, or in some other way."

bornhuetter14y ago

Thanks. So senones are not just fragments of phones - two senones could sound exactly the same, but be classified differently depending on their context within the audio stream.

gdahl14y ago

Senones are just tied triphone HMM states. A context dependent HMM recognizer has a 3-5 state HMM for every context dependent phone. Conceptually, each different HMM state in each different phone HMM has its own Gaussian mixture model, but this is awful because many of them don't get much data assigned to them. So people share parameters for different HMM states based on a data driven decision tree that clusters states together. Those clustered or tied states are sometimes called senones.

acqq14y ago· 2 in thread

The most interesting bit for me is at the end of another blog entry:

http://blogs.technet.com/b/inside_microsoft_research/archive...

"An intern at Microsoft Research Redmond, George Dahl, now at the University of Toronto,

http://www.cs.toronto.edu/~gdahl/

contributed insights into the working of DNNs and experience in training them. His work helped Yu and teammates produce a paper called Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition.

http://research.microsoft.com/pubs/144412/DBN4LVCSR-TransASL...

In October 2010, Yu presented the paper during a visit to Microsoft Research Asia. Seide was intrigued by the research results, and the two joined forces in a collaboration that has scaled up the new, DNN-based algorithms to thousands of hours of training data."

gdahl14y ago

For people interested in some (currently) undocumented research code in python implementing DNNs that is also on my website. Although the code is only an initial release. I will improve it later, but if I waited until it wasn't embarrassing I would never release it, so I just posted it.

pathdependent14y ago

Thank you for doing so!

Overwhelmingly, it is my experience that researchers in computational disciplines publish papers with half-finished code "available on request" -- and requests are often ignored. It's refreshing to hear someone say, "Yes, the code needs work, but it should be available."

richardlblair14y ago· 2 in thread

Imagine the power of this for students. This would have made school so much easier. Simply record every lecture and then use this to search for keywords.

Awesome.

toemetoch14y ago

On a different note, imagine the power of this for DRM control and censorship.

But impressive and very useful.

nddrylliog14y ago

What? What do speech recognition and fingerprinting have in common? I don't see how this research applies for DRM...

Censorship, maybe. And even then, you can't filter conversations in real-time, only maybe 'flag' people with forbidden words.

1 more reply

brutuscat14y ago· 2 in thread

This seems very related to this http://www.youtube.com/watch?v=ZmNOAtZIgIk speak by Andrew Ng. It is a 40min speak, but he explains very simply how all this works for images and some examples about the audio case. It is incredible how using this deep learning techniques we can teach this "neural networks" to recognize such complicated patterns. It is like reverse engineering the brain's algorithms.

BTW I took his Coursera's course about Machine Learning and it was great! I also recommend it A LOT to gather basic ML knowledge.

JabavuAdams14y ago

Are you still able to access the course materials? I took the course as well (and enjoyed it!) but I'd like to access the PDFs, especially.

brutuscat14y ago

Yes I have downloaded all the PDFs, email me (check my profile) I will share it via Dropbox for you ;)

tsumnia14y ago· 2 in thread

How does this compare to Microsoft's Old HTK (HMM Toolkit)? The language used on the website seems to point to a lot of the same things. Is this breaking it down to actual IPA phonemes?

I'm mostly curious because I used the HTK for my thesis and would like to know how they compare (besides, one being just 'newer').

ezy14y ago

This approach still uses HMMs, it's just that the observation probabilities are now coming from a DNN (neural network) instead of a GMM (gaussian mixture model). "Senones" are not new, HTK can use various context dependent phoneme models, and the HMM states (typically 3) within each context dependent phoneme essentially boil down to what they call a "senone" here. Interestingly, they use GMM's to bootstrap the DNN training -- which I suppose you could avoid once you have a reasonable DNN laying around.

The main difference here is hooking DNN output to an HMM decoder, replacing GMMs, and possibly even more important the training process they use to get the DNN fairly efficiently. That's the biggest thing -- GMMs, at least the last time I've looked, can be trained and adapted much quicker than a DNN.

Wilya14y ago

(I'm not an expert)

I think the HTK doesn't use neural networks at all. What it does is simply computes the MFCC of the sound signal and use it as input to a chain of HMM models. Well, "simply" that, plus the dozens of refinements and tweakings to make that work well.

Here, I guess they do some sort of preprocessing on the sounds features using their deep neural networks before feeding the whole thing to the HMMs.

dewiz14y ago· 1 in thread

Nogwater14y ago

related link comments: http://news.ycombinator.com/item?id=2936371

MichaelGG14y ago

On a immediately useful practical note, OneNote also contains this functionality (obviously not as powerful). I've used it to record a meeting's audio sync'd to my notes, and then be able to search the audio to jump exactly to where someone mentioned something and review context. Saved my ass on at least one occasion.

droz14y ago

Research paper on the system: http://www.se.cuhk.edu.hk/hccl/publications/pub/HLT2006.pdf

j / k navigate · click thread line to collapse

44 comments

36 comments · 11 top-level

cmicali14y ago· 6 in thread

breckinloggins14y ago

Also, it appears that "Deep Neural Network" isn't the most common term of art here. DNN appears to be a synonym for "Deep Belief Network".[1] Can anyone confirm?

[1] http://www.scholarpedia.org/article/Deep_belief_networks

rck14y ago

gdahl14y ago

ezy14y ago

As you might guess, they are not claiming this.

knewter14y ago

Vlingo has LITERALLY never gotten anything I said right, ever. Just a data point.

mikeash14y ago

Dn_Ab14y ago· 4 in thread

For those keeping score, google's image feature extractor shares the same core principles as microsoft's speech recognizer.

EDIT: by keeping score I mean keeping track of which techniques are being used where.

no_more_death14y ago

Am I the only one who gets tired of people keeping score like this? Can't we just accept that many of the large companies are seriously innovative?

(Sorry, I know I'm being cranky)

mikedmiked14y ago

What are these core principles?

Dn_Ab14y ago

The main takeaway is if you can find an effective way to build layered abstractions then you will learn robustly.

2 more replies

aheilbut14y ago

Do you mean the system behind the google images search?

kpozin14y ago· 3 in thread

no_more_death14y ago

Why alienate such a large segment of users after pouring so much money into their technology? The web is getting weirder.

georgemcbay14y ago

2 more replies

epo14y ago

"Android stole Apple's market" eh? Must be why Apple is losing money hand over fist. Apple fans can be annoying but fandroids are often detached from reality.

1 more reply

bornhuetter14y ago· 3 in thread

Can someone please explain senones to me? Can't find much on Google.

The article says that they are a fragment of a phoneme, but how small a fragment are we talking? 2-3 per phoneme, or many more?

Also - I'd be curious how much the phoneme in a word can vary based on accent.

dewiz14y ago

http://cmusphinx.sourceforge.net/wiki/tutorialconcepts

"Speech is a continuous audio stream where rather stable states mix with dynamically changed states. In this sequence of states, one can define more or less similar classes of sounds, or phones.

bornhuetter14y ago

Thanks. So senones are not just fragments of phones - two senones could sound exactly the same, but be classified differently depending on their context within the audio stream.

gdahl14y ago

acqq14y ago· 2 in thread

The most interesting bit for me is at the end of another blog entry:

http://blogs.technet.com/b/inside_microsoft_research/archive...

"An intern at Microsoft Research Redmond, George Dahl, now at the University of Toronto,

http://www.cs.toronto.edu/~gdahl/

http://research.microsoft.com/pubs/144412/DBN4LVCSR-TransASL...

gdahl14y ago

pathdependent14y ago

Thank you for doing so!

richardlblair14y ago· 2 in thread

Imagine the power of this for students. This would have made school so much easier. Simply record every lecture and then use this to search for keywords.

Awesome.

toemetoch14y ago

On a different note, imagine the power of this for DRM control and censorship.

But impressive and very useful.

nddrylliog14y ago

What? What do speech recognition and fingerprinting have in common? I don't see how this research applies for DRM...

Censorship, maybe. And even then, you can't filter conversations in real-time, only maybe 'flag' people with forbidden words.

1 more reply

brutuscat14y ago· 2 in thread

BTW I took his Coursera's course about Machine Learning and it was great! I also recommend it A LOT to gather basic ML knowledge.

JabavuAdams14y ago

Are you still able to access the course materials? I took the course as well (and enjoyed it!) but I'd like to access the PDFs, especially.

brutuscat14y ago

Yes I have downloaded all the PDFs, email me (check my profile) I will share it via Dropbox for you ;)

tsumnia14y ago· 2 in thread

How does this compare to Microsoft's Old HTK (HMM Toolkit)? The language used on the website seems to point to a lot of the same things. Is this breaking it down to actual IPA phonemes?

I'm mostly curious because I used the HTK for my thesis and would like to know how they compare (besides, one being just 'newer').

ezy14y ago

Wilya14y ago

(I'm not an expert)

Here, I guess they do some sort of preprocessing on the sounds features using their deep neural networks before feeding the whole thing to the HMMs.

dewiz14y ago· 1 in thread

Nogwater14y ago

related link comments: http://news.ycombinator.com/item?id=2936371

MichaelGG14y ago

droz14y ago

Research paper on the system: http://www.se.cuhk.edu.hk/hccl/publications/pub/HLT2006.pdf

j / k navigate · click thread line to collapse