## Summarizing Massive Data Sets via Large-Scale Submodular Maximization

**Prof. Andreas Krause**

*Switzerland, ETH Zürich*

How can one summarize a massive data set without being able to fit it

in the main memory of a single computer? I will present recent work on extracting representative elements from a large set of data. In particular,

I will discuss selecting a subset of, say, k points from data arriving as

a stream or spread across a cluster, which are most representative according to some objective function. Many natural notions of “representativeness” satisfy submodularity, an intuitive notion of diminishing returns. Thus, such problems can be reduced to maximizing a submodular set function subject to a cardinality constraint. Classical approaches

to submodular maximization require centralized random access to the data. I will present new algorithms with strong guarantees both for the streaming and the parallel optimization setting. I will further present experimental results on the effectiveness of our algorithms on several applications, including training large-scale kernel methods and exemplar-based clustering.

in the main memory of a single computer? I will present recent work on extracting representative elements from a large set of data. In particular,

I will discuss selecting a subset of, say, k points from data arriving as

a stream or spread across a cluster, which are most representative according to some objective function. Many natural notions of “representativeness” satisfy submodularity, an intuitive notion of diminishing returns. Thus, such problems can be reduced to maximizing a submodular set function subject to a cardinality constraint. Classical approaches

to submodular maximization require centralized random access to the data. I will present new algorithms with strong guarantees both for the streaming and the parallel optimization setting. I will further present experimental results on the effectiveness of our algorithms on several applications, including training large-scale kernel methods and exemplar-based clustering.