Computer-Human
Authentication with Applications to AI (see
CAPTCHA.net, see HIPs)

Faculty: Lenore Blum, Manuel Blum, Avrim Blum;
students: Nick Hopper, Luis von Ahn, John Langford, Scott Crosby,
Brighten Godfrey; Bartosz Przydatek, Chuck Rosenberg, Ke Yang;
partners: Udi Manber (Yahoo); Doug Tygar, Richard Fateman (UC
Berkeley); Henry Baird (Xerox Parc)

This PROBE concerns the following general problem: Design a
protocol for convincing a computer that one is a member of a specific
class of humans --- the singleton "Joan Smith" class,
or the class of all humans, etc. --- in such a way that someone
outside the class, even with access to powerful computers and
numerous examples of successful authentication, cannot impersonate
a member of the class .

The "Joan Smith" class: Current authentication methods
for monetary transactions are notoriously weak: one gives one's
social security number, address, phone number or mother's maiden
name. This information is easily captured at public phone booths,
and in any case is available to the bank clerk or broker with
whom one deals. We seek a solution to the problem, "How can
a human in a glass house login to a computer when she has nothing
but a monitor and keyboard or a telephone and keypad only?"
Glass house means that anyone can observe her login any number
of times .

The class of all humans: Consider an online poll: Which is the
best computer science graduate school? Graduate students can (and
did!) write programs to vote thousands of times for their own
school. To prevent this, we need a method to tell whether it is
a script or a human that is voting. We need what we call a Completely
Automatic Public Turing Test to tell Computers and Humans Apart
(CAPTCHA). This is an algorithm for administering and grading
a test that distinguishes humans from computers. Such an algorithm
has many applications. Indeed, Udi Manber of Yahoo has arranged
to use our rudimentary prototypes to solve several such Internet
problems: 1) The chat room problem: how to keep (ro)bots out of
chatrooms. (2) The accounts problem: how to keep bots from signing
up for (numerous) accounts. (3) The spam mail problem: how to
keep bots from flooding the net with spam mail.

Solutions to the above problems require the collaboration of
people from widely different areas .

- Cognitive Psychology

- Computer Vision and Speech

- Artificial Intelligence. A protocol for the "class of
all humans" problem is a public challenge to the AI community:
"Write a program that can authenticate itself as human
to this protocol." We intend to challenge AI researchers
with welldefined problems that the AI community find interesting
and important .
- Computational Learning Theory. Learning Theory tells which
functions are learnable from examples and, crucially, which
are possibly not learnable. Our authentication protocols must
have the property that an eavesdropper with a powerful computer,
even after observation of many successful authentications, cannot
learn to impersonate "Joan Smith." -- Cryptography
- Theory of Computation. What are the right definitions and
models for this problem? Computer Science theory is ideally
suited to asking and answering this question

Web Algorithms for Information
Retrieval

John Lafferty, Danny Sleator; Cornell, Berkeley,
Yale; AltaVista, Yahoo, Verity

With the widespread use of the World Wide Web as a communication
medium, the last several years have seen a tremendous increase
in the volume of online data that is available .

This has increased the importance of algorithms for largescale
information retrieval (IR) and has also introduced new algorithmic
problem areas in IR. Researchers on theoretical aspects of algorithms
have recently obtained significant advances in Web algorithms,
but there is a large divide between this theory and what is effective
in practical information retrieval systems. In this PROBE we will
bring together researchers from both the theoretical and applied
communities in an attempt to bridge this divide .

Many recent theoretical advances have focused on two fundamental
tools: spectral analysis and dimensionality reduction using randomization.
For example, one of the betterknown algorithmic tools using spectral
analysis is the ``hubs and authorities'' method of Kleinberg,
which is based on the computation of the principal eigenvector
of a matrix encoding the hyperlink relations in a subset of the
Web that is assumed to be relevant to a query. While the method
is very intuitive and powerful for certain types of Web queries,
its usefulness for more complicated queries has not been clearly
demonstrated. The same is true for spectral methods based on latent
semantic analysis, using a rank approximation defined in terms
of the singular value decomposition of a word document matrix
.

Even when they are effective in improving the precision of search
algorithms, spectral methods often run into a fundamental limitation---they
may not scale to extremely large data collections. To address
this difficulty, dimensionality reduction using randomization
can often be used. Theoretical work using this technique is based
on a very powerful and non-intuitive fact: a set of n
points in Euclidean space can be mapped to O((epsilon to the power
-2)logn) more than a factor of (1 + or - epsilon) . This fact
has been used recently to obtain a wide range of new algorithms
for highdimensional data, from graph embedding to nearestneighbor
search. At the heart of the Johnson-Lindenstrauss lemma is the
use of projections onto random subspaces. Related sampling techniques
can also be used to make latent semantic analysis more practical
.

Such methods are extremely powerful, yet they have been infrequently
used in practice, if they have been used at all, as practitioners
are either unfamiliar or uncomfortable with the use of randomization
.

The PROBE will explore the true usefulness of these methods by
opening up communication between disparate communities of researchers,
bringing together theoreticians and researchers who have developed
practical IR systems, to share the intuitions and experience that
they have in their respective areas. An additional focus of the
probe will be on emerging applications of IR technology, and on
the characteristics of the underlying data. Much of the focus
of recent algorithmic work has been on static and "generic''
Web data in the form of hypertext. However, databases often change
rapidly, and even the type of information can be dynamic. It may
not be possible to preindex the data to extract the appropriate
fields. Much of the incipient information and data in postgenomic
biology will have these characteristics, as will the data in other
sciences and applications outside the commercial sector .

The PROBE will also address the metrics by which information
retrieval technology is evaluated. Evaluation methods have been
greatly advanced by NIST and DARPA's TREC (The Text Retrieval
Conference) While this has enabled scientific evaluation of competing
IR techniques, many practitioners question the usefulness of binary
relevance rankings in actual systems. Alternative methods will
be explored that might better reflect the use of information retrieval
in practice .

High Dimensional
Geometry in Astrophysics

Guy Blelloch, Bob Nichol (Astrophysics); Andrew
Connolly (Pitt), Piortr Indyk (MIT)

Astrophysics is an observational science driven by the discovery
of new objects. Therefore, a key tool for astrophysicists are
surveys of our Cosmos as a function of wavelength, resolution
and time. The next decade will witness the completion of several
new and massive surveys of the whole sky thus promoting a new
era of discovery in astrophysics. The technological challenge
posed by these new surveys is no longer building the hardware
(i.e. telescopes and instruments) as it has been in the past,
but the volume and complexity of the data they will collect. As
an example, the planned survey by the Panchromatic Optical Imager
(POI) on Mauna Kea in Hawaii will use an array of four 1.3 meter
telescopes to image the whole sky in just a few nights. The final
product of POI will be a catalog of billions of sources each with
thousands of measured attributes, e.g. the brightness and of objects
as a function of time .

Because of the size and complexity of the data, these massive
data sources present a new challenge to the scientists analyzing
them. Even the simple queries that are useful to the scientists,
such as nearestneighbor queries, cluster analysis, and anomaly
searches, are beyond the capabilities of the algorithms currently
used. To maximize the impact of these sky surveys, we will therefore
need to develop new algorithmic technology that handles thousands
of dimensions. Furthermore, because of the structure of the data,
these algorithms will often need to analyze the data with nonEuclidean
geometries .

The PROBE will therefore study geometric algorithms in highdimensional
spaces with the purpose of analyzing astrophysical data. Even
though in a very different domain, the techniques required for
this problem appear to have significant overlap with the techniques
discussed in the previous PROBE on Web Algorithms and Information
Retrieval. Both problems involve similar analyses on highdimensional
spaces, e.g. nearestneighbor queries, cluster analysis, and dimensionality
reduction. One area of exploration will therefore be to identify
the similarities and differences between the data and techniques
needed for these two domains .

Applying Metric
Embeddings to Biological Databases

Dannie Durand, Alan Frieze, R. Ravi, Martin Farach-Colton
(Rutgers); Celera Genomics

Understanding how the genes and gene products in an organism
interact to produce a functioning organism, is a major scientific
challenge introduced by the recent completion of the human genome
sequence. Computation will have a crucial role in this challenge,
elucidating the networks of interacting genes that effect cellular
processes. A major part of this role is to infer how genes and
proteins interact in cellular processes from comprehensive, functional
data sets resulting from highthroughput biological assays. Thus
the underlying computational problems involve pattern discovery
and clustering in multidimensional biological databases .

Currently, there are two major approaches to solving multidimensional
clustering problems. One approach postulates a statistical model
of clustering and the resulting methods are motivated by finding
the most likely clusters among the set of hypothesised models.
Examples include kmeans, neuralnet, Bayesian network, hidden Markov
model, singlelinkage, and hierarchical clustering methods. A second
approach defines a distance function (formally, a metric) between
all pairs of database points based on their corresponding tuple
values, and uses this metric function to guide clustering and
similarity searching. Methods resulting from this approach employ
dimensionalityreduction techniques, in addition to exploiting
structural properties of the underlying metric . While the first
approach has proved effective in analyzing expression data from
DNA arrays, the second has been used in applications with very
high dimensionality (e.g., in Information Retrieval and Astrophysics
discussed above) .

The incomparability of the distances along the different axes
in protein databases poses a major hurdle in applying these two
approaches. Indeed, each axis can be thought of defining a different
kind of distance function. Furthermore, the number of dimensions
in such protein databases is mediumsized (ten to hundred) rather
than very large. The new questions that must be addressed are
how to effectively reconcile the distance information represented
along the different axes and how to judge the relative importance
of each axis in the clustering process. Neither of these is addressed
by the two current approaches .

This PROBE will investigate a promising and largely unexplored
alternative approach, the application of the mature theory of
interreducibilities among finite metric spaces to this problem.
The main idea is to classify the coordinates of the database according
to the most natural type of metric applicable to them, and find
a simple (say, linear) combination of the metrics corresponding
to the coordinates that best explains a given clustering. This
combination can then be used to cluster new points or discover
further patterns. In this way, the problems resulting from our
approach translate naturally to problems of approximation, interreductions
and embeddings in finite metric spaces. We hope to draw on the
rich body of recent algorithmic work on the quality and efficiency
achievable in embedding the points of one finite metric space
into another, and the applications thereof .

This PROBE will bring together leaders in the areas of functional
genomics and pro teomics with algorithmists that have studied
metric embedding problems. The PROBE will focus on metricbased
formulations of important clustering problems for biological datasets
.

The PROBE will also define benchmarks for evaluating different
clustering algorithms for this data and use this to guide the
formulation of new objective functions .

Graph Cuts for
Vision and Data Analysis

Avrim Blum, Jianbo Shi, Takeo Kanade; Jon Kleinberg
(Cornell), Eva Tardos (Cornell), Ramin Zabih (Cornell)

Graph cuts and graph separators of various kinds form one of
the primary areas of Algorithms research. From the classic maxflow
mincut theorem, to modern fast algorithms for network flow, to
approximation algorithms for balanced separators and minratio
cuts, to spectral methods and geometric algorithms for nested
dissection, graph cuts are an ongoing major topic of algorithm
design.

Recently, notions of graph cuts have proven to have important
application to a number of key problems in computer vision and
related data analysis tasks. There is even an upcoming special
issue of one of the top Vision journals (IEEE Transactions on
PAMI) on "Graph Algorithms in Computer Vision", and
this topic has been the subject of several workshops .

However, it is still the case that many of the theoretical objective
functions studied and guarantees proven do not truly match what
is needed by the applications. This PROBE into graph cuts for
vision and data analysis will bring theoreticians and practitioners
together in a focused effort to attack key problems in vision,
in a way that leverages off a wealth of algorithmic knowledge
and intuition surrounding graph cuts and separators. It will address
questions such as: do the needs of vision applications demand
substantially new kinds of objective functions? Does the special
structure of vision applications allow for substantially faster
algorithms? Can spectral methods provide additional benefits in
the form of confidences or probabilities? Can hierarchical graph
structures successfully incorporate needs of multiplescale and
multiresolution vision applications? Can higher level goals of
image understanding and intelligent indexing be attacked by these
same techniques?

More Sample PROBEs:

Applying new analyses of heavytailed workloads
to the design of highperformance computing systems M.
Harchol-Balter and B. Maggs

New models of data that illustrate graph
separator structure with applications to scientific computing
G. Miller

Applying graph separator methods to solving
SAT problems arising in model checking
E. Clarke and G. Blelloch

Applying graph separators to graph compressions
G. Blelloch, J. Lafferty

Using programcheckers to check the results
of programs L. Blum, M. Blum, R. Harper