Penalized Logistic Regression Models for Phenotype Prediction Based on Single Nucleotide Polymorphisms

Document Type : Research Article


1 Biomedical engineering, M.Sc student, Amirkabir University of Technology, Tehran, Iran

2 Amirkabir University of Technology, Biomedical Engineering Department

3 Amirkabir University of technology, Biomedical Engineering Department


Most of the studies on phenotype differences, including some diseases, are based on studying some specific positions in the genome called Single Nucleotide Polymorphism (SNP). Some SNPs alone and some by interacting with others, play an important role in any phenotype or specific disease. Various models, including the regression models, are designed and implemented for the prediction of these diseases. In this paper, three penalized logistic models including Ridge, Lasso and Elastic Net (EN), are used to predict the risk of a specific disease, while overcoming the limitation of the classic logistic regression on high-dimensional SNP datasets. The models are implemented on 10000 samples of the SNP datasets of OWKIN-Inserm Institute, which contains 18124 SNPs. Among these three, the Lasso model with minimizer lambda indicate higher accuracy (73.73%) and AUC (83.54%). The model is also less complex, since it eliminates less related features as much as possible and keeps only the most informative. Additionally, getting better results with Lasso indicates that multicollinearity is either not existence between variables or is low and can be neglected.


Main Subjects

[1]   J. Panigrahi, B. S. P. Mishra, and S. R. Dash, "Disease Prediction on the Basis of SNPs," in Emerging Technologies in Data Mining and Information Security: Springer, 2019, pp. 635-643.
[2]   M. D. Armstrong and F. H. Tyler, "Studies on phenylketonuria. I. Restricted phenylalanine intake in phenylketonuria," The Journal of clinical investigation, vol. 34, no. 4, pp. 565-580, 1955.
[3]   M. Waddell, D. Page, and J. Shaughnessy Jr, "Predicting cancer susceptibility from single-nucleotide polymorphism data: a case study in multiple myeloma," in Proceedings of the 5th international workshop on Bioinformatics, 2005, pp. 21-28: ACM.
[4]   K. L. Ayers and H. J. Cordell, "SNP selection in genome‚Äźwide and candidate gene studies via penalized logistic regression," Genetic epidemiology, vol. 34, no. 8, pp. 879-891, 2010.
[5]   S. Banerjee, L. Zeng, H. Schunkert, and J. Söding, "Bayesian multiple logistic regression for case-control GWAS," PLoS genetics, vol. 14, no. 12, p. e1007856, 2018.
[6]   J. L. Weissfeld et al., "Lung cancer risk prediction using common SNPs located in GWAS-identified susceptibility regions," Journal of Thoracic Oncology, vol. 10, no. 11, pp. 1538-1545, 2015.
[7]   Z. Zhu, D. Yuan, D. Luo, X. Lu, and S. Huang, "Enrichment of minor alleles of common SNPs and improved risk prediction for Parkinson's disease," PloS one, vol. 10, no. 7, p. e0133421, 2015.
[8]   C.-F. Hung et al., "A genetic risk score combining 32 SNPs is associated with body mass index and improves obesity prediction in people with major depressive disorder," BMC medicine, vol. 13, no. 1, p. 86, 2015.
[9]   S. Le Cessie and J. C. Van Houwelingen, "Ridge estimators in logistic regression," Journal of the Royal Statistical Society: Series C (Applied Statistics), vol. 41, no. 1, pp. 191-201, 1992.
[10]  R. Tibshirani, "Regression shrinkage and selection via the lasso," Journal of the Royal Statistical Society: Series B (Methodological), vol. 58, no. 1, pp. 267-288, 1996.
[11]  T. T. Wu, Y. F. Chen, T. Hastie, E. Sobel, and K. Lange, "Genome-wide association analysis by lasso penalized logistic regression," Bioinformatics, vol. 25, no. 6, pp. 714-721, 2009.
[12]  Z. Wei et al., "Large sample size, wide variant spectrum, and advanced machine-learning technique boost risk prediction for inflammatory bowel disease," The American Journal of Human Genetics, vol. 92, no. 6, pp. 1008-1012, 2013.
[13]  S. Okser, T. Pahikkala, A. Airola, T. Salakoski, S. Ripatti, and T. Aittokallio, "Regularized machine learning in the genetic prediction of complex traits," PLoS genetics, vol. 10, no. 11, p. e1004754, 2014.
[14]  G. Abraham, A. Kowalczyk, J. Zobel, and M. Inouye, "Performance and robustness of penalized and unpenalized methods for genetic prediction of complex human disease," Genetic epidemiology, vol. 37, no. 2, pp. 184-195, 2013.
[15]  D. Shigemizu et al., "The construction of risk prediction models using GWAS data and its application to a type 2 diabetes prospective cohort," PLoS One, vol. 9, no. 3, p. e92549, 2014.
[16]  S. Cherlin, R. A. Howey, and H. J. Cordell, "Using penalized regression to predict phenotype from SNP data," in BMC proceedings, 2018, vol. 12, no. 9, p. 38: BioMed Central.
[17]  T. Minami, H. Nanto, and S. Takata, "Highly conductive and transparent aluminum doped zinc oxide thin films prepared by RF magnetron sputtering," Japanese Journal of Applied Physics, vol. 23, no. 5A, p. L280, 1984.