Summarizing Massive Data Sets via Large-Scale Submodular Maximization

Prof. Andreas Krause
Switzerland, ETH Zürich
How can one summarize a massive data set without being able to fit it
in the main memory of a single computer? I will present recent work on extracting representative elements from a large set of data. In particular,
I will discuss selecting a subset of, say, k points from data arriving as
a stream or spread across a cluster, which are most representative according to some objective function. Many natural notions of “representativeness” satisfy submodularity, an intuitive notion of diminishing returns. Thus, such problems can be reduced to maximizing a submodular set function subject to a cardinality constraint. Classical approaches
to submodular maximization require centralized random access to the data. I will present new algorithms with strong guarantees both for the streaming and the parallel optimization setting. I will further present experimental results on the effectiveness of our algorithms on several applications, including training large-scale kernel methods and exemplar-based clustering.