On a Divide and Conquer Approach for Haplotype Inference with Pure Parsimony

Konstantinos Kalpakis and Parag Namjoshi



The genotype of an individual consists of two DNA strands, where each strand, called a haplotype, is inherited from each parent. The haplotypes of an individual may differ from those of his/her parents due to mutations or crossovers. Those haplotypes and their differences are the genetic basis for hereditary diseases such as Alzheimer's, diabetes, and heart diseases. Determining the genotype of an individual in the ''wet-lab'' is relatively inexpensive. However, determining an individual's haplotypes in the ''wet-lab'' is rather expensive and time-consuming. Consequently, developing computational methods for inferring haplotypes from genotypes is an important problem. Since this haplotype inference problem can have multiple solutions, solutions with as few haplotypes as possible are often sought, leading to the Haplotype Inference problem with Pure Parsimony (HIPP). We present a divide-and-conquer approach for HIPP problems with large number of individuals and long genotypes. We explore various design parameters of the divide-and-conquer approach, and experimentally compare its performance, with respect to the number of haplotypes (k) found and the running times, with that of Clark's rule on both synthetic and real datasets. We find that our approach uses up to 43% less haplotypes than Clark's rule for synthetic datasets with 100 genotypes of length 100, and that our approach and Clark's rule use similar number of haplotypes for the real datasets. Our divide-and-conquer approach is an effective and efficient method for large HIPP problems.

Index Terms Bioinformatics, haplotype inference from genotypes, pure parsimony.