CENTER Carnegie Mellon UniversityCarnegie Mellon Computer Science DepartmentSchool of Computer Science
Algorithms for Halplotype-Based Association Studies
Related Activities
Outreach Roadshow




Natalie Castellana
The ultimate goal of most work on human genome variations is establishing connections between genotype and phenotype. We wish to learn from a set of sites that vary in the population which particular sites appear to be associated with a given condition, typically a disease. These sites can then be used to identify genes involved in the disease, which may help us understand its causes or develop treatments. The problem of finding these associations is most difficult but also most important for the "complex diseases," those that are caused by a combination of genetic and environmental factors. These diseases, which include heart disease, diabetes, cancers, and Alzheimer's disease, are the major causes of death in the developed world. We therefore need to find ways to identify the relatively faint signals connecting individual variant sites to these complex diseases among the noise of environmental factors and other sites. Haplotypes provide a way to make this problem easier by condensing the information content of the genome, reducing the search space of the problem.

This project will explore the value of applying haplotype models to genetic association studies. The REU student's primary task will be conducting a comparison of single-SNP, haplotype block, and haplotype motif models in terms of their value to statistical association studies. This project will begin with generating datasets of human genetic variations by applying pre-existing simulation tools and by retrieving real data from public repositories, requiring the student to develop an understanding of models of haplotype structure and the current state of genome variation sequencing. It will then require implementing algorithms for haplotype structure inference by variations of the block and motif methods. This is expected to require some algorithmic innovations in improving on current methods for inferring haplotype structures by the motif model. The project will further require implementing several statistical tests for the single-SNP, block, and motif models to identify genetic variations correlated with phenotype as well as implementing benchmarks for evaluating the effectiveness of a given test. Finally, it will require extensive empirical studies of the conditions under which various models and metrics prove superior at detecting variants correlated with phenotype. Prerequisites for the project include knowledge of algorithm design and computer programming, some statistics, and familiarity with basic concepts from molecular and population genetics.

Preliminary Presentation (ppt)
Final Presentation (ppt)


This material is based upon work supported by National Science Foundation under Grant No. 0122581.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the
National Science Foundation