Faster Recurrent Neural Network Language Modeling Toolkit with Logarithmic and Sub-logarithmic Cost Function Approximation Algorithms

Anton Bakhtin
Russia, Yandex
Language models help to distinguish between natural sentences and artificial ones. This property makes them an essential part of any machine translation or automatic speech recognition system. Recently, continuous space language models based on Recurrent Neural Networks (RNN) were shown to outperform other models in terms of various benchmark measures. At the same time, the training of such models is challenging. First, recurrent neural networks suffer from exploding and vanishing gradient problems. As a result, special training techniques must be used to learn longer temporal patterns. Besides, RNN outputs probability of the next word at each time stamp. Usually such probabilities are computed using softmax function in linear time of number of words. However, as real world vocabularies have more than several hundred thousands of words, efficient algorithms are required to approximate next word probability in logarithmic time or faster. Finally, modern datasets have billions of words. Thus availability of efficient implementation of the baseline algorithms is crucial for further experiments and research.
We introduce Faster RNNLM, a toolkit to train language models based
on recurrent neural networks. It differs from other similar toolkits in the following main aspects. First, we use efficient methods to approximate next word probability, namely Hierarchical Softmax [5] and Noise Contrastive Estimation (NCE) [3]. As a result the toolkit could be used to train language models with huge vocabularies. Second, it supports modern hidden layer setups like Gated Recurrent Units [2] and Rectified Linear Units with diagonal initialization [1]. Finally, a Hogwild[4]-like technique is implemented to utilize several threads without significant negative impact on performance. As a result one can train a standard model for the One Billion Word Benchmark [6] with a sigmoid hidden layer of size 256 and NCE in a few hours and achieve perplexity of 127.7.
  1. Le, Q. V., Jaitly, N., & Hinton, G. E. (2015). A Simple Way to Initialize Recurrent Networks of Rectified Linear Units. arXiv preprint arXiv:1504.00941.
  1. Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.
  1. Chen, X., Liu, X., Gales, M. J. F., & Woodland, P. C. (2015). Recurrent neural network language model training with noise contrastive estimation for speech recognition.
  1. Recht, B., Re, C., Wright, S., & Niu, F. (2011). Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems (pp. 693-701). Chicago.
  1. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
  1. Chelba, C., Mikolov, T., Schuster, M., Ge, Q., Brants, T., Koehn, P., & Robinson, T. (2013). One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005.