A rapid heuristic algorithm to solve the single individual haplotype assembly problem

Document Type : Research Article

Authors

Department of Electrical Engineering, Sharif University of Technology, Tehran, Iran

Abstract

The Haplotype Assembly is the computational process in which two distinct nucleotide sequences of chromosomes are reconstructed using the sequencing reads of an individual. The ability to identify haplotypes provides many benefits for future genomic-based studies to be conducted in many areas, such as drug design, population study, and disease diagnosis. Even though several approaches have been put out to achieve highly accurate haplotypes, the problem of quick and precise haplotype assembly remains a challenging task. Due to the enormous bulk of the high-throughput sequencing data, algorithm speed plays a crucial role in the possibility of haplotype assembly in the human genome dimension. This study introduces a heuristic technique that enables rapid haplotype reconstruction while maintaining respectable accuracy. Our approach is divided into two parts. In the first, a partial haplotype is created and enlarged over a number of iterations. We have employed a novel metric to assess the reconstructed haplotype's quality in each iteration to arrive at the optimal answer. The second stage of the algorithm involves refining the reconstructed haplotypes to increase their accuracy. The outcome reveals that the suggested approach is capable of reconstructing the haplotypes with an acceptable level of accuracy. In terms of speed, the performance of the algorithm surpasses the competing approaches, especially in the case of high-coverage sequencing data.

Keywords

Main Subjects


[1] P. Gupta, J. Roy, M. Prasad, Single nucleotide polymorphisms: a new paradigm for molecular marker technology and DNA polymorphism detection with emphasis on their use in plants, Current science,  (2001) 524-535.
[2] R.A. Gibbs, J.W. Belmont, P. Hardenbol, T.D. Willis, F. Yu, H. Yang, L.-Y. Ch'ang, W. Huang, B. Liu, Y. Shen, The international HapMap project,  (2003).
[3] X. Zhu, S. Zhang, D. Kan, R. Cooper, Haplotype block definition and its application, in:  Biocomputing 2004, World Scientific, 2003, pp. 152-163.
[4] M.-H. Moeinzadeh, J. Yang, E. Muzychenko, G. Gallone, D. Heller, K. Reinert, S. Haas, M. Vingron, Ranbow: a fast and accurate method for polyploid haplotype reconstruction, PLOS Computational Biology, 16(5) (2020) e1007843.
[5] S. Majidian, M.H. Kahaei, D. De Ridder, Hap10: Reconstructing accurate and long polyploid haplotypes using linked reads, BMC bioinformatics, 21(1) (2020) 1-18.
[6] A. Najafi, D. Nashta-ali, S.A. Motahari, M. Khani, B.H. Khalaj, H.R. Rabiee, Fundamental limits of pooled-DNA sequencing, arXiv preprint arXiv:1604.04735,  (2016).
[7] A. Rhoads, K.F. Au, PacBio sequencing and its applications, Genomics, proteomics & bioinformatics, 13(5) (2015) 278-289.
[8] R.J. Roberts, M.O. Carneiro, M.C. Schatz, The advantages of SMRT sequencing, Genome biology, 14(6) (2013) 1-4.
[9] H. Lu, F. Giordano, Z. Ning, Oxford Nanopore MinION sequencing and genome assembly, Genomics, proteomics & bioinformatics, 14(5) (2016) 265-279.
[10] M.A. Quail, I. Kozarewa, F. Smith, A. Scally, P.J. Stephens, R. Durbin, H. Swerdlow, D.J. Turner, A large genome center's improvements to the Illumina sequencing system, Nature methods, 5(12) (2008) 1005-1010.
[11] G. Lancia, V. Bafna, S. Istrail, R. Lippert, R. Schwartz, SNPs problems, complexity, and algorithms, in:  ESA, Springer, 2001, pp. 182-193.
[12] R. Lippert, R. Schwartz, G. Lancia, S. Istrail, Algorithmic strategies for the single nucleotide polymorphism haplotype assembly problem, Briefings in bioinformatics, 3(1) (2002) 23-31.
[13] V. Bansal, V. Bafna, HapCUT: an efficient and accurate algorithm for the haplotype assembly problem, Bioinformatics, 24(16) (2008) i153-i159.
[14] W. Qian, Y. Yang, N. Yang, C. Li, Particle swarm optimization for SNP haplotype reconstruction problem, Applied mathematics and Computation, 196(1) (2008) 266-272.
[15] T.-C. Wang, J. Taheri, A.Y. Zomaya, Using genetic algorithm in reconstructing single individual haplotype with minimum error correction, Journal of biomedical informatics, 45(5) (2012) 922-930.
[16] M.-H. Olyaee, A. Khanteymoori, AROHap: An effective algorithm for single individual haplotype reconstruction based on asexual reproduction optimization, Computational biology and chemistry, 72 (2018) 1-10.
[17] S. Majidian, M.H. Kahaei, NGS based haplotype assembly using matrix completion, PLoS One, 14(3) (2019) e0214455.
[18] S. Majidian, M.M. Mohades, M.H. Kahaei, Matrix completion with weighted constraint for haplotype estimation, Digital Signal Processing, 108 (2021) 102880.
[19] M.M. Mohades, S. Majidian, M.H. Kahaei, Haplotype assembly using manifold optimization and error correction mechanism, IEEE Signal Processing Letters, 26(6) (2019) 868-872.
[20] A. Panconesi, M. Sozio, Fast hare: A fast heuristic for single individual SNP haplotype reconstruction, in:  Algorithms in Bioinformatics: 4th International Workshop, WABI 2004, Bergen, Norway, September 17-21, 2004. Proceedings 4, Springer, 2004, pp. 266-277.
[21] X.-S. Xu, Y.-X. Li, Semi-supervised clustering algorithm for haplotype assembly problem based on MEC model, International journal of data mining and bioinformatics, 6(4) (2012) 429-446.
[22] S. Mazrouee, W. Wang, FastHap: fast and accurate single individual haplotype reconstruction using fuzzy conflict graphs, Bioinformatics, 30(17) (2014) i371-i378.
[23] M.H. Olyaee, A. Khanteymoori, E. Fazli, A fuzzy c-means clustering approach for haplotype reconstruction based on minimum error correction, Informatics in Medicine Unlocked, 25 (2021) 100646.
[24] F. Zamani, M.H. Olyaee, A. Khanteymoori, NCMHap: a novel method for haplotype reconstruction based on Neutrosophic c-means clustering, Bmc Bioinformatics, 21 (2020) 1-15.
[25] Y. Ono, K. Asai, M. Hamada, PBSIM: PacBio reads simulator—toward accurate genome assembly, Bioinformatics, 29(1) (2013) 119-121.
[26] Y. Ono, K. Asai, M. Hamada, PBSIM2: a simulator for long-read sequencers with a novel generative model of quality scores, Bioinformatics, 37(5) (2021) 589-595.
[27] APP amyloid beta precursor protein [Homo sapiens (human)] - Gene - NCBI, in, National Center for Biotechnology Information.
[28] H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin, G.P.D.P. Subgroup, The sequence alignment/map format and SAMtools, bioinformatics, 25(16) (2009) 2078-2079.
[29] P. Edge, V. Bafna, V. Bansal, HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies, Genome research, 27(5) (2017) 801-812.
[30] S. Majidian, M.H. Kahaei, D. de Ridder, Minimum error correction-based haplotype assembly: Considerations for long read data, Plos one, 15(6) (2020) e0234470.