Theory/O.R. Schedules-Aladdin Center

Frontiers in Computational Molecular Biology: The Shape of Life Seminar

Combinatorial Approaches to Haplotyping
Dan Gusfield, Professor and Chairman of the Computer Science Department, University of California, Davis

Abstract

The next high-priority phase of human genomics will involve the development of a full Haplotype Map of the human genome. It will be used in large-scale screens of populations to associate specific haplotypes with specific complex genetic-influenced diseases. However, most studies will collect genotype data rather than haplotype data, requiring the deduction of haplotype information from genotype data. Input to the haplotyping problem is a set of N genotypes, and output is N haplotype pairs that "explain" the genotypes. Several computational approaches to this problem have been developed and used. Most of these use a statistical framework, such as MLE, or statistical computing methods such as EM, Gibbs sampling etc.

We have developed several distinct combinatorial approaches to the haplotyping problem. I will talk mainly about the most recent of these approaches, based on viewing the haplotyping problem in the context of perfect phylogeny. The biological motivation for this is the surprising fact that some genomic DNA can be partitioned into long blocks where recombination has been rare, leading to strikingly fewer distinct haplotypes in the population than previously expected. This, along with assumption of infinite sites implies that a "permitted" solution to the haplotyping problem should fit (or almost fit) a perfect phylogeny. This is a severe combinatorial constraint on the permitted solutions to the haplotyping problem, and it leads to efficient deterministic algorithms (I will talk about two of them) to deduce all features of the permitted haplotype solution(s) that can be known with certainty. We obtain: a) an efficient algorithm to find (if one exists) one permitted solution to the haplotype problem;
b) a simple test to determine if it is the unique permitted solution;
c) an efficient way to count the number of permitted solutions;
d) and an efficient way to implicitly represent the set of all permitted solutions so that each can be efficiently created.

As a by-product, we prove that the number of permitted solutions is bounded by 2^k, where k is less than or equal to the number of sites. This is a huge reduction from the number of solutions if we do not require that a solution fit a perfect phylogeny. This implicit representation also allows simple solutions to secondary questions about the genotypes. We also observe a strong phase-transition in the probability that a given number of genotypes will be sufficient to uniquely determine the tree.

Finally, I will discuss recent results obtained trying to incorporate constrained recombination into the underlying perfect phylogeny model.

Papers and software can be obtained at www.cs.ucdavis.edu/~gusfield

Parts of this work are joint with different collaborators, including R.H. Chung, V. Bafna, G. Lancia, C. Langley, and S. Yooseph

Host: Russell Schwartz - Biological Sciences
Co-Sponsored by: Merck Computational Biology and Chemistry Program, the ALADDIN Center and the SCS Dean's Office
For further seminar information: plp @ cs.cmu.edu

Short Biography

My background is in Combinatorial Optimization, and various applications of combinatorial optimization. But in the last 15 years I have mostly addressed problems in computational biology. I first addressed questions about building evolutionary trees, and then problems in sequence analysis. I presently focus mostly on problems of population-scale genomics, of which haplotype inference is one example. My book, "Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology" has helped to define the intersection of computer science and bioinformatics. I serve on the editorial board of the Journal of Computational Biology. At UCD I am chair of the Computer Science Department, and wrote the bioinformatics part of the genomics initiative proposal that resulted in the creation of the UCD Genomics Center, and the current construction of the largest building on campus. I continue to serve on the steering committee of the Genomics Center. I also developed and continue to teach an undergraduate course on the Theory and Practice of Bioinformatics taught mostly to biology students, and a graduate course on Sequence Analysis and Computational Biology, taught to Computer Science students