A Multi-pronged Parallel Approach to Enhance Speed and Accuracy of Sequence Assembly

Munib Ahmed, Mohammad Saad Ahmad and Ishfaq Ahmad



Genome assembly is one of the most computationally complex processes in the field of bioinformatics. This complexity stems from the requirement that a very large number of sequenced segments, each 700-1000 bases long, be put together to reconstruct the full genome that, depending upon the specie, could be more than several billion bases in length. Moreover, due to the limitations of some laboratory procedures that occur during the sequencing of the genome, a full reconstruction with 100% accuracy is practically infeasible. One of the early and critical phases of assembly is the detection of overlaps among segments. In this work, we propose a holistic approach to simultaneously enhance the execution speed and improve the accuracy using an error correction logic in a highperformance computing environment. By leveraging the extra processing power available in a parallel computing environment, we attempt to correct errors in the weaker end regions as opposed to trimming them off thereby enhancing the accuracy of the solution. The speed is improved by dynamically balancing the load among multiple processors and utilizing innovative data structures along with a hashing technique that require relatively lesser memory compared to some other programs.

Index Terms Accuracy, genome assembly, parallel processing.