Mergeable Summaries, tutorial
In dealing with massive distributed data, exact computations may be possible but can still be very expensive to perform. Instead, it is often desirable to look to more lightweight computations that give (guaranteed) approximations. A central approach is the idea of the mergeable summary: a compact data structure that summarizes a data set, which can be merged with others to summarize the union of the data without significant growth in size. This framework enables parallel computation. Samples and sketches are two examples of well-known, effective mergeable summaries. This talk presents recent results on new, compact mergeable summaries for various tasks, such as approximating frequent items, order statistics, and range counts.
Sketch Data Structures and Concentration Bounds
One natural way to deal with the challenge of Big Data is to make the data smaller. That is, to seek a compact (sublinear) representation of the data so that certain properties are (approximately) preserved. An important class of summarization algorithms is sketches: carefully designed random projections of the input data that can be computed efficiently under the constraints of the streaming model. These have the attractive property that they can be easily computed in parallel over partitions of the input. They aim at optimizing a variety of properties: the size of the summary; the time required to compute the summary; the number of 'true' random bits required; and the accuracy guarantees that result. This tutorial explains sketches from first principles, and introduces the concentration bounds (Markov inequality, Chebyshev inequality and Chernoff-Hoeffding bounds) used to prove their properties.