Approximating String Compressibility and the Distribution Support
Size
Sofya Raskhodnikova, Penn State
January 25, 2008, 3:30PM, Wean 7220
Abstract:
Imagine having to choose between a few compression schemes to compress
a very long file. Before deciding on the scheme, you might want to obtain
a rough estimate of how well each scheme performs on your file. Another
motivating scenario comes from areas, such as data-mining, natural
language processing and genomics, where compression schemes are used as
tools
for measuring properties of strings such as similarity and entropy. In
these applications, one typically needs only the length of the compressed
version of a file, not the output itself.
In this talk, we consider the question of approximating
compressibility of a string with respect to a fixed compression
scheme, in sublinear time. We will concentrate on the run-length
encoding and a version of Lempel-Ziv as our example compression
schemes. We present algorithms and lower bounds for approximating
compressibility with respect to both schemes.
The second problem we consider is approximating the support size of a
distribution from a small number of samples, when each element in the
distribution's
support appears with probability at least 1/n. Our motivation is tight
relationship between this problem and estimating compressibility with
respect to Lempel-Ziv. However, estimating distribution support size
is also considered in other areas of computer science along with an
important variant, estimating the number of distinct elements in a
sequence of length n.
For all three problems (distribution support, distinct elements and
Lempel-Ziv compressibility), we prove a nearly linear (in n) lower
bound on the query complexity, applicable even for approximation with
additive error. At the heart of the lower bound is a construction of
two positive integer random variables, X and Y, with very different
expectations and the following condition on the first k moments:
E[X]/E[Y] = E[X2]/E{Y2] = ... = E[X^k]/E[Y^k].
Our lower bound method is also applicable to other problems. In
particular, it gives a new lower bound for the sample complexity of
approximating
the entropy of a distribution.
Based on joint work with Dana Ron, Ronitt Rubinfeld and Adam Smith,
published in RANDOM 2007 and joint work with Dana Ron, Amir Shpilka
and Adam Smith, published in FOCS 2007