On a Divide and Conquer Approach for Haplotype Inference with Pure Parsimony
The genotype of an individual consists of two DNA
strands, where each strand, called a haplotype, is inherited
from each parent. The haplotypes of an individual may differ
from those of his/her parents due to mutations or crossovers.
Those haplotypes and their differences are the genetic basis
for hereditary diseases such as Alzheimer's, diabetes, and heart
diseases. Determining the genotype of an individual in the
''wet-lab'' is relatively inexpensive. However, determining an
individual's haplotypes in the ''wet-lab'' is rather expensive
and time-consuming. Consequently, developing computational
methods for inferring haplotypes from genotypes is an important
problem. Since this haplotype inference problem can have
multiple solutions, solutions with as few haplotypes as possible
are often sought, leading to the Haplotype Inference problem
with Pure Parsimony (HIPP).
We present a divide-and-conquer approach for HIPP problems
with large number of individuals and long genotypes. We
explore various design parameters of the divide-and-conquer
approach, and experimentally compare its performance, with
respect to the number of haplotypes (k) found and the running
times, with that of Clark's rule on both synthetic and real
datasets. We find that our approach uses up to 43% less
haplotypes than Clark's rule for synthetic datasets with 100
genotypes of length 100, and that our approach and Clark's
rule use similar number of haplotypes for the real datasets. Our
divide-and-conquer approach is an effective and efficient method
for large HIPP problems.
Index Terms
Bioinformatics, haplotype inference from genotypes, pure parsimony.