Denis Elshin (speaker), Daria Gubar
Every day millions of users utilize Yandex web search to follow recent news, watch their favorite TV shows, find information for school projects, and for thousands of other reasons. Being able to retrieve the most popular subjects of users’ interest is extremely tempting: such information helps the Yandex PR team organize local campaigns based on interest trends of a certain region; it provides content for the mass media about the popularity of celebrities and events at a certain time; on top of that, using trendy topics may help define web search areas where performance enhancement will result in growth of user numbers.
However, the only information this tool may be based on is a set of user search queries. Single queries, even popular ones, are not representative enough for statistical analysis, since there might be a subject with a large number of reformulations (like Ice Hockey World Championship, which contains match standings for different countries), and, conversely, a subject with one essential keyword (like Eurovision). Hence, to achieve the desired result, one should build a framework grouping search queries into topics by their meaning. This is a popular area in machine learning, known as text clustering.
Text clustering is a challenging problem due to its high dimensionality (texts are represented as vectors the size of a vocabulary), and features of a language (synonymy), or corpus (stop words). In this paper, we demonstrate how to cope with them using an alternative view of a standard topic modeling approach, Latent Dirichlet Allocation. We demonstrate the relation of LDA inference to entropy minimization; we provide a general framework for model extension using this connection. With the help of one specific case of the described LDA extension, we amend the standard short text clustering procedure, inferring a set of topics far enough from each other (in terms of distribution distance), but containing words that frequently appear together; clusters are then built based on extracted topics.