Mining Cytochrome b561 from Plant Genomes

Stephen O. Opiyo and Etsuko N. Moriyama



Cytochrome b561 (Cyt-b561) proteins play important functions in plants such as anti-toxin defense reactions, growth and development, and prevention of damage to plants from excess light under drought condition. Because of their high sequence divergence, thorough mining of Cyt-b561 and related proteins from diverse plant genomes is not easy. For example, currently there is only one Cyt-b561 gene in the maize genome and none has been found from the soybean genome, while twenty two are known in the Arabidopsis thaliana genome. Alignment-free methods for protein classification, e.g., multivariate statistical analysis methods using various amino acid properties as sequence descriptors, can be more sensitive for remotely similar protein identification compared to often-used alignment-based methods. In order to identify Cyt-b561 proteins thoroughly from available plant genomes, we examined alignment-free protein classifiers based on partial least squares (PLS) and support vector machines. These classifiers performed better than profile hidden Markov models and PSI-BLAST in identifying Cyt-b561 related proteins. Furthermore, PLS with a reduced number of descriptors performed the best among both of alignment-based and alignment-free classifiers we tested. This classifier had the highest accuracy (96.2%) and the lowest false negative rate (3.0%), and should be useful for mining Cyt-b561 related proteins from diverse plant genomes.

Index Terms Cytochrome b561, partial least squares, support vector machines, profile hidden Markov model.

Full Text (PDF)