CENTER Carnegie Mellon UniversityCarnegie Mellon Computer Science DepartmentSchool of Computer Science
Sample PROBEs
Related Activities
Outreach Roadshow

Computer-Human Authentication with Applications to AI (see, see HIPs)

Faculty: Lenore Blum, Manuel Blum, Avrim Blum; students: Nick Hopper, Luis von Ahn, John Langford, Scott Crosby, Brighten Godfrey; Bartosz Przydatek, Chuck Rosenberg, Ke Yang; partners: Udi Manber (Yahoo); Doug Tygar, Richard Fateman (UC Berkeley); Henry Baird (Xerox Parc)

This PROBE concerns the following general problem: Design a protocol for convincing a computer that one is a member of a specific class of humans --- the singleton "Joan Smith" class, or the class of all humans, etc. --- in such a way that someone outside the class, even with access to powerful computers and numerous examples of successful authentication, cannot impersonate a member of the class .

The "Joan Smith" class: Current authentication methods for monetary transactions are notoriously weak: one gives one's social security number, address, phone number or mother's maiden name. This information is easily captured at public phone booths, and in any case is available to the bank clerk or broker with whom one deals. We seek a solution to the problem, "How can a human in a glass house login to a computer when she has nothing but a monitor and keyboard or a telephone and keypad only?" Glass house means that anyone can observe her login any number of times .

The class of all humans: Consider an online poll: Which is the best computer science graduate school? Graduate students can (and did!) write programs to vote thousands of times for their own school. To prevent this, we need a method to tell whether it is a script or a human that is voting. We need what we call a Completely Automatic Public Turing Test to tell Computers and Humans Apart (CAPTCHA). This is an algorithm for administering and grading a test that distinguishes humans from computers. Such an algorithm has many applications. Indeed, Udi Manber of Yahoo has arranged to use our rudimentary prototypes to solve several such Internet problems: 1) The chat room problem: how to keep (ro)bots out of chatrooms. (2) The accounts problem: how to keep bots from signing up for (numerous) accounts. (3) The spam mail problem: how to keep bots from flooding the net with spam mail.

Solutions to the above problems require the collaboration of people from widely different areas .

  • Cognitive Psychology
  • Computer Vision and Speech
  • Artificial Intelligence. A protocol for the "class of all humans" problem is a public challenge to the AI community: "Write a program that can authenticate itself as human to this protocol." We intend to challenge AI researchers with welldefined problems that the AI community find interesting and important .
  • Computational Learning Theory. Learning Theory tells which functions are learnable from examples and, crucially, which are possibly not learnable. Our authentication protocols must have the property that an eavesdropper with a powerful computer, even after observation of many successful authentications, cannot learn to impersonate "Joan Smith." -- Cryptography
  • Theory of Computation. What are the right definitions and models for this problem? Computer Science theory is ideally suited to asking and answering this question

Web Algorithms for Information Retrieval

John Lafferty, Danny Sleator; Cornell, Berkeley, Yale; AltaVista, Yahoo, Verity

With the widespread use of the World Wide Web as a communication medium, the last several years have seen a tremendous increase in the volume of online data that is available .

This has increased the importance of algorithms for largescale information retrieval (IR) and has also introduced new algorithmic problem areas in IR. Researchers on theoretical aspects of algorithms have recently obtained significant advances in Web algorithms, but there is a large divide between this theory and what is effective in practical information retrieval systems. In this PROBE we will bring together researchers from both the theoretical and applied communities in an attempt to bridge this divide .

Many recent theoretical advances have focused on two fundamental tools: spectral analysis and dimensionality reduction using randomization. For example, one of the betterknown algorithmic tools using spectral analysis is the ``hubs and authorities'' method of Kleinberg, which is based on the computation of the principal eigenvector of a matrix encoding the hyperlink relations in a subset of the Web that is assumed to be relevant to a query. While the method is very intuitive and powerful for certain types of Web queries, its usefulness for more complicated queries has not been clearly demonstrated. The same is true for spectral methods based on latent semantic analysis, using a rank approximation defined in terms of the singular value decomposition of a word document matrix .

Even when they are effective in improving the precision of search algorithms, spectral methods often run into a fundamental limitation---they may not scale to extremely large data collections. To address this difficulty, dimensionality reduction using randomization can often be used. Theoretical work using this technique is based on a very powerful and non-intuitive fact: a set of n points in Euclidean space can be mapped to O((epsilon to the power -2)logn) more than a factor of (1 + or - epsilon) . This fact has been used recently to obtain a wide range of new algorithms for highdimensional data, from graph embedding to nearestneighbor search. At the heart of the Johnson-Lindenstrauss lemma is the use of projections onto random subspaces. Related sampling techniques can also be used to make latent semantic analysis more practical .
Such methods are extremely powerful, yet they have been infrequently used in practice, if they have been used at all, as practitioners are either unfamiliar or uncomfortable with the use of randomization .

The PROBE will explore the true usefulness of these methods by opening up communication between disparate communities of researchers, bringing together theoreticians and researchers who have developed practical IR systems, to share the intuitions and experience that they have in their respective areas. An additional focus of the probe will be on emerging applications of IR technology, and on the characteristics of the underlying data. Much of the focus of recent algorithmic work has been on static and "generic'' Web data in the form of hypertext. However, databases often change rapidly, and even the type of information can be dynamic. It may not be possible to preindex the data to extract the appropriate fields. Much of the incipient information and data in postgenomic biology will have these characteristics, as will the data in other sciences and applications outside the commercial sector .

The PROBE will also address the metrics by which information retrieval technology is evaluated. Evaluation methods have been greatly advanced by NIST and DARPA's TREC (The Text Retrieval Conference) While this has enabled scientific evaluation of competing IR techniques, many practitioners question the usefulness of binary relevance rankings in actual systems. Alternative methods will be explored that might better reflect the use of information retrieval in practice .

High Dimensional Geometry in Astrophysics

Guy Blelloch, Bob Nichol (Astrophysics); Andrew Connolly (Pitt), Piortr Indyk (MIT)

Astrophysics is an observational science driven by the discovery of new objects. Therefore, a key tool for astrophysicists are surveys of our Cosmos as a function of wavelength, resolution and time. The next decade will witness the completion of several new and massive surveys of the whole sky thus promoting a new era of discovery in astrophysics. The technological challenge posed by these new surveys is no longer building the hardware (i.e. telescopes and instruments) as it has been in the past, but the volume and complexity of the data they will collect. As an example, the planned survey by the Panchromatic Optical Imager (POI) on Mauna Kea in Hawaii will use an array of four 1.3 meter telescopes to image the whole sky in just a few nights. The final product of POI will be a catalog of billions of sources each with thousands of measured attributes, e.g. the brightness and of objects as a function of time .

Because of the size and complexity of the data, these massive data sources present a new challenge to the scientists analyzing them. Even the simple queries that are useful to the scientists, such as nearestneighbor queries, cluster analysis, and anomaly searches, are beyond the capabilities of the algorithms currently used. To maximize the impact of these sky surveys, we will therefore need to develop new algorithmic technology that handles thousands of dimensions. Furthermore, because of the structure of the data, these algorithms will often need to analyze the data with nonEuclidean geometries .

The PROBE will therefore study geometric algorithms in highdimensional spaces with the purpose of analyzing astrophysical data. Even though in a very different domain, the techniques required for this problem appear to have significant overlap with the techniques discussed in the previous PROBE on Web Algorithms and Information Retrieval. Both problems involve similar analyses on highdimensional spaces, e.g. nearestneighbor queries, cluster analysis, and dimensionality reduction. One area of exploration will therefore be to identify the similarities and differences between the data and techniques needed for these two domains .

Applying Metric Embeddings to Biological Databases

Dannie Durand, Alan Frieze, R. Ravi, Martin Farach-Colton (Rutgers); Celera Genomics

Understanding how the genes and gene products in an organism interact to produce a functioning organism, is a major scientific challenge introduced by the recent completion of the human genome sequence. Computation will have a crucial role in this challenge, elucidating the networks of interacting genes that effect cellular processes. A major part of this role is to infer how genes and proteins interact in cellular processes from comprehensive, functional data sets resulting from highthroughput biological assays. Thus the underlying computational problems involve pattern discovery and clustering in multidimensional biological databases .

Currently, there are two major approaches to solving multidimensional clustering problems. One approach postulates a statistical model of clustering and the resulting methods are motivated by finding the most likely clusters among the set of hypothesised models. Examples include kmeans, neuralnet, Bayesian network, hidden Markov model, singlelinkage, and hierarchical clustering methods. A second approach defines a distance function (formally, a metric) between all pairs of database points based on their corresponding tuple values, and uses this metric function to guide clustering and similarity searching. Methods resulting from this approach employ dimensionalityreduction techniques, in addition to exploiting structural properties of the underlying metric . While the first approach has proved effective in analyzing expression data from DNA arrays, the second has been used in applications with very high dimensionality (e.g., in Information Retrieval and Astrophysics discussed above) .

The incomparability of the distances along the different axes in protein databases poses a major hurdle in applying these two approaches. Indeed, each axis can be thought of defining a different kind of distance function. Furthermore, the number of dimensions in such protein databases is mediumsized (ten to hundred) rather than very large. The new questions that must be addressed are how to effectively reconcile the distance information represented along the different axes and how to judge the relative importance of each axis in the clustering process. Neither of these is addressed by the two current approaches .

This PROBE will investigate a promising and largely unexplored alternative approach, the application of the mature theory of interreducibilities among finite metric spaces to this problem. The main idea is to classify the coordinates of the database according to the most natural type of metric applicable to them, and find a simple (say, linear) combination of the metrics corresponding to the coordinates that best explains a given clustering. This combination can then be used to cluster new points or discover further patterns. In this way, the problems resulting from our approach translate naturally to problems of approximation, interreductions and embeddings in finite metric spaces. We hope to draw on the rich body of recent algorithmic work on the quality and efficiency achievable in embedding the points of one finite metric space into another, and the applications thereof .

This PROBE will bring together leaders in the areas of functional genomics and pro teomics with algorithmists that have studied metric embedding problems. The PROBE will focus on metricbased formulations of important clustering problems for biological datasets .
The PROBE will also define benchmarks for evaluating different clustering algorithms for this data and use this to guide the formulation of new objective functions .

Graph Cuts for Vision and Data Analysis

Avrim Blum, Jianbo Shi, Takeo Kanade; Jon Kleinberg (Cornell), Eva Tardos (Cornell), Ramin Zabih (Cornell)

Graph cuts and graph separators of various kinds form one of the primary areas of Algorithms research. From the classic maxflow mincut theorem, to modern fast algorithms for network flow, to approximation algorithms for balanced separators and minratio cuts, to spectral methods and geometric algorithms for nested dissection, graph cuts are an ongoing major topic of algorithm design.

Recently, notions of graph cuts have proven to have important application to a number of key problems in computer vision and related data analysis tasks. There is even an upcoming special issue of one of the top Vision journals (IEEE Transactions on PAMI) on "Graph Algorithms in Computer Vision", and this topic has been the subject of several workshops .
However, it is still the case that many of the theoretical objective functions studied and guarantees proven do not truly match what is needed by the applications. This PROBE into graph cuts for vision and data analysis will bring theoreticians and practitioners together in a focused effort to attack key problems in vision, in a way that leverages off a wealth of algorithmic knowledge and intuition surrounding graph cuts and separators. It will address questions such as: do the needs of vision applications demand substantially new kinds of objective functions? Does the special structure of vision applications allow for substantially faster algorithms? Can spectral methods provide additional benefits in the form of confidences or probabilities? Can hierarchical graph structures successfully incorporate needs of multiplescale and multiresolution vision applications? Can higher level goals of image understanding and intelligent indexing be attacked by these same techniques?

More Sample PROBEs:

Applying new analyses of heavytailed workloads to the design of highperformance computing systems M. Harchol-Balter and B. Maggs

New models of data that illustrate graph separator structure with applications to scientific computing G. Miller

Applying graph separator methods to solving SAT problems arising in model checking E. Clarke and G. Blelloch

Applying graph separators to graph compressions G. Blelloch, J. Lafferty

Using programcheckers to check the results of programs L. Blum, M. Blum, R. Harper


This material is based upon work supported by National Science Foundation under Grant No. 0122581.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the
National Science Foundation