Deep Generative and Discriminative Models for Speech Recognition

Dr. Li Deng
USA, Microsoft Research
Artificial neural networks have been around for over half a century and their applications to speech processing have been around almost as long, yet it was not until 2010 that their real impact was made by a deep form
of such networks. I will reflect on the path to this transformative success, after providing a brief review the earlier work on (shallow) neural nets and (deep) generative models relevant to the introduction of deep neural nets (DNN) to speech recognition several years ago. The role of well-timed academic-industrial collaboration is highlighted, so are the advances of big data, big compute, and the seamless integration between the application-domain knowledge of speech and general principles of deep learning. Then, an overview will be given on the sweeping achievements of deep learning in speech recognition since its initial success in 2010, including the modern, state-of-the-art results. Such achievements have resulted in across-the-board, industry-wide deployment of deep learning. Finally, I will review technical progresses of deep learning for speech recognition in recent years, as well as its future direction, in the following major areas:
  1. scaling up/out and speedup DNN training and decoding;
  2. sequence discriminative training of DNNs;
  3. feature processing by deep models;
  4. adaptation of DNNs and of related deep models;
  5. multi-task and transfer learning by DNNs and related deep models;
  6. convolution neural networks;
  7. recurrent neural networks and its rich LSTM variants;
  8. other types of deep models including tensor-based models and inte- grated deep generative and discriminative models.
In particular, I will elaborate on the strengths and weaknesses of deep discriminative models (e.g. DNNs) and deep generative models, and discuss ways of integrating the two styles of deep models to get the best of both worlds.