http://blogs.technet.com/b/inside_microsoft_research/archive...
"An intern at Microsoft Research Redmond, George Dahl, now at the University of Toronto,
http://www.cs.toronto.edu/~gdahl/
contributed insights into the working of DNNs and experience in training them. His work helped Yu and teammates produce a paper called Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition.
http://research.microsoft.com/pubs/144412/DBN4LVCSR-TransASL...
In October 2010, Yu presented the paper during a visit to Microsoft Research Asia. Seide was intrigued by the research results, and the two joined forces in a collaboration that has scaled up the new, DNN-based algorithms to thousands of hours of training data."
Overwhelmingly, it is my experience that researchers in computational disciplines publish papers with half-finished code "available on request" -- and requests are often ignored. It's refreshing to hear someone say, "Yes, the code needs work, but it should be available."
If a company invests in multiple markets, they should be prepared to do well in some markets and badly in others. Bing isn't as good as Google. Android isn't as well-designed as Metro. Yes, Android stole Apple's market, and, yes, Apple stole someone else's market. The large technology companies are deadlocked on multiple fronts. That fuels fierce competition and inspires excellence and choice. However, companies should accept they just aren't the best at everything. Let us make our own choices based on what's best for us.
Awesome.
But impressive and very useful.
Censorship, maybe. And even then, you can't filter conversations in real-time, only maybe 'flag' people with forbidden words.
The article says that they are a fragment of a phoneme, but how small a fragment are we talking? 2-3 per phoneme, or many more?
Also - I'd be curious how much the phoneme in a word can vary based on accent.
"Speech is a continuous audio stream where rather stable states mix with dynamically changed states. In this sequence of states, one can define more or less similar classes of sounds, or phones.
Words are understood to be built of phones, but this is certainly not true. The acoustic properties of a waveform corresponding to a phone can vary greatly depending on many factors - phone context, speaker, style of speech and so on. The so called coarticulation makes phones sound very different from their “canonical” representation. Next, since transitions between words are more informative than stable regions, developers often talk about diphones - parts of phones between two consecutive phones. Sometimes developers talk about subphonetic units - different substates of a phone. Often three or more regions of a different nature can easily be found.
The number three is easily explained. The first part of the phone depends on its preceding phone, the middle part is stable, and the next part depends on the subsequent phone. That's why there are often three states in a phone selected for HMM recognition.
Sometimes phones are considered in context. There are triphones or even quinphones. But note that unlike phones and diphones, they are matched with the same range in waveform as just phones. They just differ by name. That's why we prefer to call this object senone. A senone's dependence on context could be more complex than just left and right context. It can be a rather complex function defined by a decision tree, or in some other way."
EDIT: by keeping score I mean keeping track of which techniques are being used where.
(Sorry, I know I'm being cranky)
The key idea is that if you train each layer in an unsupervised manner and then feed its outputs as features for the next layer it performs better when you go on to train it in a supervised way. That is, back-propagation on the pre-trained Neural net, learns a far more robust set of weights than without pretraining. Stochastic gradient descent is a very simple technique that is useful for optimization when you are working with massive data.
The architecture Dahl used layers as RBM (very similar to autoencoders) to seed a regular ole but many layered Feedforward network. SGD is used to do back propagation. RBMs themselves are trained using a generative technique - see Contrastive divergence for more.
The google architecture is more complex and based on biological models. It is not trying to learn an explicit classifier hence they train a many layered autoencoder network to learn features. I only skimmed the paper but they have multiple layers specialized to a particular type of processing (think photoshop not intel) and using SGD they optimize an objective that is essentially learning an effective decomposition on the data.
The main takeaway is if you can find an effective way to build layered abstractions then you will learn robustly.
BTW I took his Coursera's course about Machine Learning and it was great! I also recommend it A LOT to gather basic ML knowledge.
I'm mostly curious because I used the HTK for my thesis and would like to know how they compare (besides, one being just 'newer').
The main difference here is hooking DNN output to an HMM decoder, replacing GMMs, and possibly even more important the training process they use to get the DNN fairly efficiently. That's the biggest thing -- GMMs, at least the last time I've looked, can be trained and adapted much quicker than a DNN.
I think the HTK doesn't use neural networks at all. What it does is simply computes the MFCC of the sound signal and use it as input to a chain of HMM models. Well, "simply" that, plus the dozens of refinements and tweakings to make that work well.
Here, I guess they do some sort of preprocessing on the sounds features using their deep neural networks before feeding the whole thing to the HMMs.
The true "breakthrough" here would be if Microsoft made a voice recognition system that could run entirely on a device (no internet connection needed) and accurately understand speech without terabytes of training data or a local user training session. I can't tell from the article if this is what Microsoft is claiming.
Also, it appears that "Deep Neural Network" isn't the most common term of art here. DNN appears to be a synonym for "Deep Belief Network".[1] Can anyone confirm?
[1] http://www.scholarpedia.org/article/Deep_belief_networks
They basically are using a new (in the context of speech rec) technique that seems to improve accuracy by 16% relative on their test data (and using their code :-)). It's a really great result, but it doesn't change the basic nature of a state of the art speech recognizer at all -- you still need to train and adapt it -- and it still needs lots and lots of data.