Additive Regularization of Topic Models: Towards Exploratory Search and Other Multi-Criteria Applications

Prof. Konstantin Vorontsov
Russia, Yandex
Topic modeling is a powerful tool for revealing an underlying thematic structure in large text collections. A lot of work has been done in enriching probabilistic generative models of texts in order to find more precise and linguistically motivated topic models. However, avoiding the bag-of-words assumption usually leads to complicated and memory-inefficient models that can’t be freely combined with other topic models extensions.
To address this issue we are developing a non-Bayesian approach of additive regularization for topic models (ARTM). It is based on inducing an additional penalty term for each requirement and then solving a multi-criteria optimization task. We propose a set of regularization criteria to separate topics into two subsets: domain-specific topics containing lex- ical kernels of particular domain area and background topics containing common lexis words.
Until this moment, the ARTM approach has also been restricted by the bag-of-words assumption. Our work is focused on overcoming this limitation and utilizing the sequential order of tokens to further improve additively regularized topics models. The key idea is to model gradual topic transition in natural language discourse. To this end we consider topic probabilities of sequential tokens of a text as a time series and perform smoothing based on the terminology occurred in a sliding window. Preferring similar topic distributions for adjacent words corresponds
to the notion of topic coherence – a common intepretability measure in topic modeling that is shown to correlate well with human judgment.
We contribute our linguistically motivated regularizers into BigARTM.org – an open source project for online parallel topic modeling of large text collections. Results are obtained using real collections of scientific papers and social networks data.