Identification of important SNPs using penalized models and Bayesian Deep Learning on whole-genome Arabidopsis thaliana data

Nikita Kohli

Summary

Content type	Content type Digital Document
Collection(s)	Collection(s) Master of Science in Data Science
Resource Type	Resource Type Text
Genre	Genre thesis

Origin Information

Date Created/Date Issued	2023
Publisher	Thompson Rivers University
Issuance	monographic

Persons and Affiliations

Persons	Author (aut): Kohli, Nikita Thesis advisor (ths): Jabed, Tomal Thesis advisor (ths): Yan, Yan
Organizations	Degree granting institution (dgg): Thompson Rivers University. Faculty of Science

Abstract/Description

Abstract

The process of identifying the most important and informative features from a data set for a particular task is known as feature selection. Feature selection is a critical problem for statistical modeling and machine learning since employing all features might result in over-fitting. In high-dimensional data, where the number of features can be significantly greater than the number of samples, feature selection is even more difficult. One such application where the high-dimensional challenge is common is in Genome-Wide Association Studies (GWAS). GWAS aims to identify the relationship between genetic variations, such as Single Nucleotide Polymorphisms (SNPs), and physical traits. Feature selection algorithms based on statistical and machine learning methods are often used to tackle the high-dimensionality problem. This research aims to tackle this challenge by proposing two workflows to identify several potentially important SNPs. The first workflow, PentaPen, combines five penalized models - Ridge, LASSO, and Elastic net using all SNPs and Group LASSO and Sparse Group Lasso (SGL) using filtered SNPs (union of SNPs selected by Ridge, LASSO, and Elastic net). The second workflow, BayesDL, combines Bayesian methods with deep learning using preliminary filtered SNPs found by Chi-square and ANOVA as input. PentaPen, a machine learning model, aims to provide reduced numbers of SNPs by leveraging the beneficial properties of five penalized models. The union of SNPs selected by Group LASSO and SGL are the output SNPs of PentaPen. BayesDL, a cascaded deep learning model, along with identifying important SNPs, aims to provide high prediction performance. BayesDL also mitigates the issue of over-fitting while handling data sets with fewer sample sizes, a limitation in various traditional neural networks. The performances of the proposed workflows are compared with the existing methodologies based on the quality metrics, Precision, Recall, F1 score, AUC, R-squared, RMSE, and Accuracy. The systematic comparison of single penalized models provides a guideline for researchers to make informed decisions to choose a penalized model. BayesDL’s performance is compared with the Convolutional neural network (CNN). In addition, the important SNPs from both workflows are validated to locate genes; these are compared with the output SNPs or genes between each other and from GWAS software (GAPIT and TASSEL). Findings of the continuous and categorical phenotype of Arabidopsis thaliana plant data indicate that PentaPen performs similarly to LASSO and Elastic Net while better than Ridge, Group LASSO, and SGL by reducing over-fitting. Reduced over-fitting was evident with a 10% decrease in the testing metrics compared to the training metrics. PentaPen performs similarly to Ridge, LASSO, and Elastic Net for the binary phenotype. BayesDL performs better than CNN for all the phenotypes. The findings from the proposed workflows complement with GWAS, using different models (generalized linear models in GAPIT and TASSEL versus penalized models and probabilistic models in two proposed workflows respectively). My study provides a classifier and regressor - PentaPen - for researchers finding reduced numbers of important SNPs for further analysis; the study also provides a rigorous comparison of penalized models to gain insights into the strengths and predictive performance of each model. Furthermore, the study also gives the bioinformatics community a cascaded classifier and regressor, BayesDL or Bayesian Neural Network (BNN), useful for the prediction and identification of important SNPs in the whole-genome SNP data.

Language

English

Institutional Affiliation

Degree Name	Degree Name Master of Science in Data Science
Degree Level	Degree Level Masters
Department	Department Faculty of Science
Institution	Institution Thompson Rivers University

Access and Rights

Use and Reproduction	Use and Reproduction author author
Use License	Use License Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

Subjects and Classifications

Keywords	Keywords Genomic Wide Association Study Single Nucleotide Polymorphism SNP Identification Machine Learning Deep Learning High Dimensional Data
Subject Topic	Subject Topic Single nucleotide polymorphisms Deep learning (Machine learning)

Identification of important SNPs using penalized models and Bayesian Deep Learning on whole-genome Arabidopsis thaliana data

Download

Cite this

Share