Frontiers in Computational Molecular Biology:
The Shape of Life Seminar
Combinatorial Approaches to Haplotyping
Dan Gusfield, Professor and Chairman
of the Computer Science Department, University of California,
Davis
Abstract
The next high-priority phase of human genomics will involve the
development of a full Haplotype Map of the human genome. It will
be used in large-scale screens of populations to associate specific
haplotypes with specific complex genetic-influenced diseases.
However, most studies will collect genotype data rather than haplotype
data, requiring the deduction of haplotype information from genotype
data. Input to the haplotyping problem is a set of N genotypes,
and output is N haplotype pairs that "explain" the genotypes.
Several computational approaches to this problem have been developed
and used. Most of these use a statistical framework, such as MLE,
or statistical computing methods such as EM, Gibbs sampling etc.
We have developed several distinct combinatorial approaches
to the haplotyping problem. I will talk mainly about the most
recent of these approaches, based on viewing the haplotyping problem
in the context of perfect phylogeny. The biological motivation
for this is the surprising fact that some genomic DNA can be partitioned
into long blocks where recombination has been rare, leading to
strikingly fewer distinct haplotypes in the population than previously
expected. This, along with assumption of infinite sites implies
that a "permitted" solution to the haplotyping problem
should fit (or almost fit) a perfect phylogeny. This is a severe
combinatorial constraint on the permitted solutions to the haplotyping
problem, and it leads to efficient deterministic algorithms (I
will talk about two of them) to deduce all features of the permitted
haplotype solution(s) that can be known with certainty. We obtain:
a) an efficient algorithm to find (if one exists) one permitted
solution to the haplotype problem;
b) a simple test to determine if it is the unique permitted solution;
c) an efficient way to count the number of permitted solutions;
d) and an efficient way to implicitly represent the set of all
permitted solutions so that each can be efficiently created.
As a by-product, we prove that the number of permitted solutions
is bounded by 2^k, where k is less than or equal to the number
of sites. This is a huge reduction from the number of solutions
if we do not require that a solution fit a perfect phylogeny.
This implicit representation also allows simple solutions to secondary
questions about the genotypes. We also observe a strong phase-transition
in the probability that a given number of genotypes will be sufficient
to uniquely determine the tree.
Finally, I will discuss recent results obtained trying to incorporate
constrained recombination into the underlying perfect phylogeny
model.
Papers and software can be obtained at
www.cs.ucdavis.edu/~gusfield
Parts of this work are joint with different collaborators, including
R.H. Chung, V. Bafna, G. Lancia, C. Langley, and S. Yooseph
Host: Russell Schwartz - Biological Sciences
Co-Sponsored by: Merck Computational Biology and Chemistry Program,
the ALADDIN Center and the SCS Dean's Office
For further seminar information: plp @ cs.cmu.edu
Short Biography
My background is in Combinatorial Optimization, and
various applications of combinatorial optimization. But in the
last 15 years I have mostly addressed problems in computational
biology. I first addressed questions about building evolutionary
trees, and then problems in sequence analysis. I presently focus
mostly on problems of population-scale genomics, of which haplotype
inference is one example. My book, "Algorithms on Strings,
Trees and Sequences: Computer Science and Computational Biology"
has helped to define the intersection of computer science and
bioinformatics. I serve on the editorial board of the Journal
of Computational Biology. At UCD I am chair of the Computer Science
Department, and wrote the bioinformatics part of the genomics
initiative proposal that resulted in the creation of the UCD Genomics
Center, and the current construction of the largest building on
campus. I continue to serve on the steering committee of the Genomics
Center. I also developed and continue to teach an undergraduate
course on the Theory and Practice of Bioinformatics taught mostly
to biology students, and a graduate course on Sequence Analysis
and Computational Biology, taught to Computer Science students