THOMPSON RIVERS UNIVERSITY Identification of Important SNPs using Penalized Models and Bayesian Deep Learning on Whole-Genome Arabidopsis Thaliana Data By Nikita Kohli A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in Data Science Kamloops, British Columbia July 2023 SUPERVISORS Dr. Jabed Tomal Dr. Yan Yan © Nikita Kohli, 2023 ABSTRACT The process of identifying the most important and informative features from a data set for a particular task is known as feature selection. Feature selection is a critical problem for statistical modeling and machine learning since employing all features might result in over-fitting. In high-dimensional data, where the number of features can be significantly greater than the number of samples, feature selection is even more difficult. One such application where the high-dimensional challenge is common is in Genome-Wide Association Studies (GWAS). GWAS aims to identify the relationship between genetic variations, such as Single Nucleotide Polymorphisms (SNPs), and physical traits. Feature selection algorithms based on statistical and machine learning methods are often used to tackle the high-dimensionality problem. This research aims to tackle this challenge by proposing two workflows to identify several potentially important SNPs. The first workflow, PentaPen, combines five penalized models - Ridge, LASSO, and Elastic net using all SNPs and Group LASSO and Sparse Group Lasso (SGL) using filtered SNPs (union of SNPs selected by Ridge, LASSO, and Elastic net). The second workflow, BayesDL, combines Bayesian methods with deep learning using preliminary filtered SNPs found by Chi-square and ANOVA as input. PentaPen, a machine learning model, aims to provide reduced numbers of SNPs by leveraging the beneficial properties of five penalized models. The union of SNPs selected by Group LASSO and SGL are the output SNPs of PentaPen. BayesDL, a cascaded deep learning model, along with identifying important SNPs, aims to provide high prediction performance. BayesDL also mitigates the issue of over-fitting while handling data sets with fewer sample sizes, a limitation in various traditional neural networks. ii iii The performances of the proposed workflows are compared with the existing methodologies based on the quality metrics, Precision, Recall, F1 score, AUC, R-squared, RMSE, and Accuracy. The systematic comparison of single penalized models provides a guideline for researchers to make informed decisions to choose a penalized model. BayesDL’s performance is compared with the Convolutional neural network (CNN). In addition, the important SNPs from both workflows are validated to locate genes; these are compared with the output SNPs or genes between each other and from GWAS software (GAPIT and TASSEL). Findings of the continuous and categorical phenotype of Arabidopsis thaliana plant data indicate that PentaPen performs similarly to LASSO and Elastic Net while better than Ridge, Group LASSO, and SGL by reducing over-fitting. Reduced over-fitting was evident with a 10% decrease in the testing metrics compared to the training metrics. PentaPen performs similarly to Ridge, LASSO, and Elastic Net for the binary phenotype. BayesDL performs better than CNN for all the phenotypes. The findings from the proposed workflows complement with GWAS, using different models (generalized linear models in GAPIT and TASSEL versus penalized models and probabilistic models in two proposed workflows respectively). My study provides a classifier and regressor - PentaPen - for researchers finding reduced numbers of important SNPs for further analysis; the study also provides a rigorous comparison of penalized models to gain insights into the strengths and predictive performance of each model. Furthermore, the study also gives the bioinformatics community a cascaded classifier and regressor, BayesDL or Bayesian Neural Network (BNN), useful for the prediction and identification of important SNPs in the whole-genome SNP data. iv Key Words: Genomic Wide Association Study; Single Nucleotide Polymorphism; SNP Identification; Machine Learning; Deep Learning; High Dimensional Data. ACKNOWLEDGEMENTS I first express my gratitude to my supervisors, Dr. Jabed Tomal and Dr. Yan Yan, who guided me through this thesis. They trusted me to build on their collective knowledge, which serves as the basis of this research. I thank Dr. Qinglin (Roger) Yu, the MSc Data Science program coordinator, for his assistance in keeping me on schedule. I would like to extend my profound gratitude to the faculty at Thompson Rivers University, particularly Dr. Emad Mohammed, Dr. Yue Zhang, Dr. Mateen Shaikh, Dr. Mila Kwiatkowska, Dr. Piper Jackson, Dr. Erfanul Hoque, and Dr. Trent Tucker for their patience and support. I acknowledge the Program of Data Science at Thompson Rivers University and Compute Canada for hosting the 22GB and 32GB Linux servers respectively which are used for computation in this research. I would also like to acknowledge the funding for this research from: 1. Natural Sciences and Engineering Research Council of Canada (NSERC) awarded to Dr. Jabed Tomal, Department of Mathematics and Statistics, and Dr. Yan Yan, Department of Computing Science respectively, Thompson Rivers University. Finally, I express my gratitude to the God, my parents, my sister, and my brother for their everlasting love and support. v Contents 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Research Models: An Overview . . . . . . . . . . . . . . . . . 4 1.3.1 Penalized Models . . . . . . . . . . . . . . . . . . . . . 4 1.3.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . 6 1.3.2.1 Hypothesis test for preliminary screening . . . 6 1.3.2.2 Models . . . . . . . . . . . . . . . . . . . . . 6 1.4 GWAS Software: A brief overview . . . . . . . . . . . . . . . . 8 1.5 Research Objectives, Questions, and Contributions . . . . . . 9 1.6 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.7 Experimental Data . . . . . . . . . . . . . . . . . . . . . . . . 11 vi CONTENTS vii 1.8 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . 13 1.9 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2 Literature Review 2.1 From DNA to SNP . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.1 2.2 14 GWAS Analysis . . . . . . . . . . . . . . . . . . . . . . 15 GWAS Program . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.1 Advantages and Limitations of GWAS . . . . . . . . . 19 2.2.2 GWAS Software . . . . . . . . . . . . . . . . . . . . . . 20 2.2.2.1 2.2.3 Bonferroni Correction . . . . . . . . . . . . . 22 Genotypes . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3 Common Statistical and Machine Learning Algorithms . . . . 25 2.4 Penalized Methodologies in the existing literature: A Brief Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.5 2.4.1 Comparative Studies of penalized methodologies . . . . 28 2.4.2 Algorithms developed based on penalized methodologies 30 Neural Networks in the Existing Literature: A Brief Overview 31 2.5.1 Comparison studies involving neural networks . . . . . 32 2.5.2 Algorithms developed based on neural network . . . . . 33 CONTENTS viii 3 Research Methodology 35 3.1 Machine Learning for GWAS . . . . . . . . . . . . . . . . . . . 35 3.1.1 Experimental Research Design . . . . . . . . . . . . . . 36 3.1.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.1.2.1 Ridge . . . . . . . . . . . . . . . . . . . . . . 39 3.1.2.2 LASSO . . . . . . . . . . . . . . . . . . . . . 41 3.1.2.3 Elastic Net . . . . . . . . . . . . . . . . . . . 41 3.1.2.4 Group LASSO . . . . . . . . . . . . . . . . . 42 3.1.2.5 Sparse Group Lasso (SGL) 3.1.2.6 K-Fold Cross-Validation . . . . . . . . . . . . 43 3.1.2.6.1 3.1.2.7 3.1.3 3.2 . . . . . . . . . . 43 Best λ . . . . . . . . . . . . . . . . . 44 Performance Metrics . . . . . . . . . . . . . . 44 3.1.2.7.1 Classification . . . . . . . . . . . . . 44 3.1.2.7.2 Regression . . . . . . . . . . . . . . . 46 3.1.2.7.3 Multi-class classification . . . . . . . 47 PentaPen: A Comprehensive Workflow . . . . . . . . . 47 Deep Learning for GWAS . . . . . . . . . . . . . . . . . . . . 49 3.2.1 Experimental Research Design . . . . . . . . . . . . . . 49 CONTENTS ix 3.2.1.1 3.2.2 Preliminary feature selection . . . . . . . . . 51 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.2.2.1 Leveraging Hypothesis Testing for Preliminary Screening . . . . . . . . . . . . . . . . . 56 3.2.2.2 3.2.2.3 3.2.2.1.1 Chi-square Test . . . . . . . . . . . . 56 3.2.2.1.2 ANOVA . . . . . . . . . . . . . . . . 57 Neural Networks . . . . . . . . . . . . . . . . 58 3.2.2.2.1 Convolutional neural network . . . . 58 3.2.2.2.2 Bayesian neural network . . . . . . . 60 3.2.2.2.3 Activation functions . . . . . . . . . 63 3.2.2.2.4 Priors . . . . . . . . . . . . . . . . . 63 Posterior Sampling . . . . . . . . . . . . . . . 64 3.2.2.3.1 3.2.3 Software Used . . . . . . . . . . . . . 64 3.2.2.4 SNP Identification using Post-Feature Selection 65 3.2.2.5 Performance Metrics . . . . . . . . . . . . . . 66 BayesDL: A Comprehensive Workflow . . . . . . . . . 67 3.2.3.1 Stan: Model Specification . . . . . . . . . . . 69 3.2.3.2 R: From Sampling to Inference . . . . . . . . 70 CONTENTS x 3.2.3.3 SNP Identification . . . . . . . . . . . . . . . 71 4 Results 4.1 4.2 73 Machine Learning for GWAS . . . . . . . . . . . . . . . . . . . 73 4.1.1 Binary Phenotype . . . . . . . . . . . . . . . . . . . . . 80 4.1.2 Continuous Phenotype . . . . . . . . . . . . . . . . . . 82 4.1.3 Categorical Phenotype . . . . . . . . . . . . . . . . . . 85 4.1.4 Evaluation of Group LASSO and SGL . . . . . . . . . 86 Deep Learning for GWAS . . . . . . . . . . . . . . . . . . . . 87 4.2.1 MCMC Diagnostics . . . . . . . . . . . . . . . . . . . . 91 4.2.2 Binary Phenotype . . . . . . . . . . . . . . . . . . . . . 95 4.2.3 Continuous Phenotype . . . . . . . . . . . . . . . . . . 96 4.2.4 Categorical Phenotype . . . . . . . . . . . . . . . . . . 97 5 Discussion and Conclusion 5.1 Identified SNPs . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.1.1 5.2 106 Linkage Disequilibrium . . . . . . . . . . . . . . . . . . 109 Models’ Performance . . . . . . . . . . . . . . . . . . . . . . . 109 5.2.1 Controlling R-squared Errors: Impact on Performance . 112 CONTENTS xi 5.3 Data Dimensionality and Model Complexity . . . . . . . . . . 113 5.4 Prior Distributions . . . . . . . . . . . . . . . . . . . . . . . . 116 5.5 Limitations and Guidelines for BayesDL . . . . . . . . . . . . 116 5.6 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.7 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 Appendices 143 Appendix A PentaPen Results 144 Appendix B Deep Learning Results 148 List of Figures 1.1 Mapping research objectives, questions, and contributions. This diagram illustrates the formulation of research objectives, and research questions, and identifying contributions to a study. . 10 2.1 A graphical explanation of Single Nucleotide Polymorphism (SNP). The diagram illustrates the concept of SNP, a common genetic variation characterized by a single base pair change in the DNA sequence. . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Steps for GWAS Analysis. This diagram outlines the sequential steps involved in conducting a GWAS analysis, starting from data collection (a) to obtaining final SNP results (g) for further statistical analysis. . . . . . . . . . . . . . . . . . . . . 16 3.1 Study Design for penalized models flowchart. This flowchart illustrates the study design for penalized models starting from Pre-processing to important SNP identification and validation. The dashed box indicates the parallel computing across 5-fold. 40 xii LIST OF FIGURES 3.2 xiii Study Design for Deep Learning flowchart. This flowchart depicts the experiment design for deep learning, outlining the key steps involved in building and training neural networks. The flowchart follows the pre-processing till important SNP identification and validation. . . . . . . . . . . . . . . . . . . . 55 3.3 Fully Connected Feed-forward Neural Network Architecture. This diagram illustrates a neural network with an input layer, m hidden layers, and one output layer. The input, hidden layer, and output weights can be represented as θI , θHi , and θo respectively. bI , bHi , and bo can represent the biases of input, hidden, and output layers respectively. . . . . . . . . . . 59 3.4 Bayesian neural network with an input layer, 2 hidden layers with 50 units each layer, and one output layer. The neural network’s neurons are interconnected by lines, and each line has a weighted distribution in each layer. Additionally, the biases associated with each layer also have distributions. . . . 62 4.1 Comparison of five penalized models among themselves and with PentaPen using all SNPs as predictors for the binary phenotype, Anthocyanin. Comparison of Group LASSO and SGL among themselves using pooled SNPs. The performance metrics are recorded for both training and testing sets. . . . . 74 LIST OF FIGURES 4.2 xiv Comparison of five penalized models among themselves and with PentaPen using all SNPs as predictors for the continuous phenotypes, Width and DTF. Comparison of Group LASSO and SGL among themselves using pooled SNPs. The performance metric, R-squared, displayed in this figure is recorded for both training and testing sets. The evaluation using RMSE can be done using Tables A.1, A.2, and A.3 in Appendix. . . . 75 4.3 Comparison of five penalized models among themselves and with PentaPen using all SNPs as predictors for the categorical phenotype, Germination Days. Comparison of Group LASSO and SGL among themselves using pooled SNPs. Accuracy is recorded for both training and testing sets. . . . . . . . . . . . 76 4.4 Comparison of BayesDL with a deep learning model, CNN, for all the phenotypes. The two-split test performance metrics are recorded for comparison. . . . . . . . . . . . . . . . . . . . . . 88 4.5 Plots for prior and posterior samples. The green line indicates the prior distribution while the purple line refers to the distribution of posterior samples. . . . . . . . . . . . . . . . . . . . 90 4.6 Auto-correlation functions using RStan against MCMC iterations. Functions of weights corresponding to the top 2 predictors (or SNPs) for all the phenotypes. The x-axis and y-axis represent Lags and Auto-correlation values respectively. . . . . 92 LIST OF FIGURES 4.7 xv Trace plots using RStan against post-warmup MCMC iterations. Here are the functions for the top 2 predictors (or SNPs) across all phenotypes and the y-axis represents the value of weights corresponding to each SNP. The trace plots assess the convergence, stability, and distribution of the weights throughout the MCMC sampling process. . . . . . . . . . . . . . . . . 93 4.8 Posterior uni-variate distributions along the diagonal and bivariate distributions along the off-diagonal using RStan against MCMC iterations. These functions represent the top 2 predictors (or SNPs) for each phenotype. The x-axis for each posterior distribution is the value of weights corresponding to the input. Whereas the scatter plot helps to check the correlation between the SNPs. . . . . . . . . . . . . . . . . . . . . . 94 List of Tables 1.1 Phenotype data used in this study. Overview of Phenotype Data from AtPolyDB and F1-hybrids data sets, categorized by Variable Type. . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1 Data structure for ped file. Overview of PED File Structure: Family ID, Individual ID, Parental IDs, Gender, Phenotype, and SNP Information . . . . . . . . . . . . . . . . . . . . . . . 23 2.2 Data structure for the map file. Overview of MAP File Structure: Chromosomes, SNP Identifiers, Genetic Distance, and Base-Pair Positions . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3 Overview of the attributes of HapMap- a genotype file. . . . . 24 2.4 Columns of VCF (Variant Call Format)- a genotype file . . . . 26 3.1 The number of SNPs selected for the corresponding significance level threshold. The chi-square test and ANOVA are used to record these thresholds. The marked number of SNPs are finally used as input of Neural networks. . . . . . . . . . . 52 xvi LIST OF TABLES 4.1 xvii Number of important SNPs selected by penalized models, SNP Pooling, and PentaPen (the proposed workflow). The SNP Pool was used as the input for Group LASSO and SGL in the proposed workflow using penalized models. . . . . . . . . . . . 77 4.2 Computation time (in seconds) of penalized methods and the workflow using penalized methodologies . . . . . . . . . . . . . 78 4.3 Top 10 SNPs for all the phenotypes using GAPIT. The highlighted SNPs are shared SNPs with TASSEL except DTF and Germ having no shared SNPs. . . . . . . . . . . . . . . . . . . 79 4.4 Top 10 SNPs for all the phenotypes using TASSEL. The highlighted SNPs are shared SNPs with GAPIT with the exception of DTF and Germ having no shared SNPs. . . . . . . . . . . . 79 4.9 Computation time (in seconds) of proposed workflow based on BNN across all phenotypes . . . . . . . . . . . . . . . . . . . . 89 4.5 SNP Validation for Anthocyanin. The table displays the top 10 SNPs reported by PentaPen. Bold values of genes and SNPs have the true function of the phenotype. . . . . . . . . . 98 4.6 SNP Validation for Width. The table displays the top 10 SNPs reported by PentaPen. The SNPs and genes which are bold, show the true function of Width. . . . . . . . . . . . . . . . . 99 4.7 SNP Validation for DTF. The table displays the top 10 SNPs reported by PentaPen. The highlighted genes and SNPs show the true function of DTF. . . . . . . . . . . . . . . . . . . . . 100 LIST OF TABLES 4.8 xviii SNP Validation for Germination Days. The table displays the potentially important SNPs reported by PentaPen. The gene and SNPs showing the true function of the phenotype are highlighted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.10 SNP Validation for Anthocyanin. The table displays the top 10 SNPs reported by BayesDL. The highlighted genes and SNPs are found to show characteristics of Anthocyanin. . . . . 102 4.11 SNP Validation for Width. The table displays the top 10 SNPs reported by BayesDL. The bold values of SNPs and genes are responsible for the observed trait. . . . . . . . . . . . . . . . . 103 4.12 SNP Validation for DTF. The table displays the top 10 SNPs reported by BayesDL. There is only one gene or SNP that is associated with DTF, which is highlighted in the table. . . . . 104 4.13 SNP Validation for Germination Days. The table displays the top 10 SNPs reported by BayesDL. The highlighted genes and SNPs are found to be associated with the phenotype. . . . . . 105 LIST OF TABLES 5.1 xix Comparison of two developed workflows based on the data dimensionality and model complexity. Here, n, p, and o denote the number of samples, predictors, and output dimensions or classes of the data set respectively. The number of folds, iterations, chains, hidden layers, and nodes in each layer is represented by k, i, c, m, and N respectively. ppool are the number of SNPs from SNP Pooling. The number of SNPs preliminary selected from the hypothesis test are given by pf ilter . Lastly, P is the total number of parameters of the workflow. . . . . . 113 A.1 Comparison of Ridge, LASSO, and Elastic Net without SNP Pooling for all the phenotypes. The performance metrics are recorded for both training and testing sets. . . . . . . . . . . . 145 A.2 Comparison of Group LASSO, SGL using all SNPs, and PentaPez for all the phenotypes. The performance metrics are recorded for both training and testing sets. . . . . . . . . . . . 146 A.3 Comparison of penalized methodologies using filtered SNPs as predictors for all the phenotypes. The performance metrics are recorded for both training and testing sets. . . . . . . . . . 147 B.1 Comparison of BayesDL with a deep learning model, CNN. The two-split test performance metrics are recorded for comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 Nomenclature ABNN Approximate Bayesian Neural Network ANN Artificial Neural Network ANOVA Analysis of Variance BLINK Bayesian information and Linkage disequilibrium iteratively nested keyway BLUP Best Linear Unbiased Prediction BNN Bayesian Neural Network BRR Bayesian Ridge Regression CAR Correlation-Adjusted marginal correlation CNN Convolutional Neural Network CoV Coefficient of Variation CV Cross-validation DL Deep Learning DNA Deoxyribonucleic acid xx NOMENCLATURE xxi DNN Deep Neural Network FarmCPU Fixed and Random Model Circulating Probability Unification GAPIT Genome Association and Prediction Integrated Tool GBM Gradient Boosting Machine GBT Gradient Boosting Tree GLM Generalized Linear Model GS Genomic Selection GWAS Genome Wide Association Studies HMC Hamiltonian Monte-Carlo ICA Independent Component Analysis LASSO Least Absolute Shrinkage and Selection Operator LDA Linear Discriminant Analysis LD Linkage Disequilibrium LMM Linear Mixed Model MA Metropolis Algorithm MCMC Markov Chain Monte Carlo MLMM Multiple Loci Mixed Model ML Machine Learning NEG Normal Exponential Gamma NOMENCLATURE NUTS No-U Turn PCA Principal Component Analysis PMR Penalized Multiple Regression PRS Polygenic Risk Scores PUMA Penalized Unified Multiple Analysis RF Random Forest RLS Regularized Least Squares SGL Sparse Group LASSO SNP Single Nucleotide Polymorphisms SSVS Stochastic Search Variable Selection SVM Support Vector Machine t-SNE t-distributed Stochastic Neighbor Embedding TASSEL Trait Analysis by aSSociation, Evolution and Linkage UMAP uniform manifold approximation and projection (UMAP) VCF Variant Call Format VI Variational Inference WES Whole-Exome Sequencing WGS Whole-Genome Sequencing xxii Chapter 1 Introduction Feature selection is a process of identifying and selecting the most informative and relevant features in a data set. It is an important issue in statistical modeling and machine learning since using all features might result in over-fitting and poor model performance. In high-dimensional data sets, this problem becomes particularly challenging, where the number of features or variables is much larger than the number of samples. Due to the high dimensionality, the data matrix is sparse and has a lot of noise or irrelevant features, which might lead to the models’ poor performance. Genome-wide association studies (GWAS) is a popular bioinformatics application of a high-dimensional data set where feature selection is challenging. GWAS discovers associations between specific DNA variants (Single Nucleotide Polymorphism (SNP)) and phenotype. SNP is the key to understanding the genetic causes of an organism’s physical traits (Allen et al. [2014]). The substitution of a C for a G in the nucleotide sequence AACGAT to produce the sequence AACCAT is an example of an SNP (Britannica [2019]). 1 There are a few limitations while performing association tests using whole-genome SNP data. On extensive complex data, GWAS methods frequently have constraints on the SNP scale and may fail to detect associated SNPs (Korte and Farlow [2013]). In addition, the sparsity of data in highdimensional domains leads to the “curse of dimensionality” (Bellman and Kalaba [1959]). In other words, it occurs for whole-genome SNP data when the number of SNPs (p) is much larger than the number of samples (n) which could be one of the possible challenges for researchers to perform the analysis. 1.1 Motivation Although several studies have been conducted in the past using machine learning and deep learning algorithms on human whole-genome SNP data, there is limited research focused on plant data. This limitation is noteworthy because plant genomes exhibit distinct genetic characteristics when compared to human genomes. Plant genomes are more complex and large in dimension; they have a higher degree of homozygosity. Moreover, the extended growth period of plants, such as wheat requiring 120 days, necessitates longer research timelines. The constrained research budget also plays a role, with researchers often prioritizing the study of animal species over plants (Xu et al. [2014]). As a result, plant genomics research faces challenges related to genetic complexity and computational time to perform classification and regression. These differences lead to looking for different statistical approaches and tuning parameters. This research aims to address the challenge to identify important SNPs from plant whole-genome SNP data. This study uses 2 the model organism, Arabidopsis thaliana due to its faster growth rate, high genetic diversity, fully sequenced genome, and well-characterized SNP data sets available. It is a small flowering plant that is a member of the mustard (Brassicaceae) family, which includes cultivated species such as cabbage and radish. 1.2 Scope One possible way to solve the dimensionality problem and reduce the number of loci associated with the trait (increases the detection power) is to use feature selection. Supervised feature selection is typically used for classification or regression applications on whole-genome SNP data to find potentially important SNPs (or features). It attempts to choose a subset of features that can classify data into separate groups or measure up to the regression targets (Li et al. [2017]). A previous study from our group (Puliparambil et al. [2022]) investigated the curse of dimensionality issue on high-dimensional scRNA-seq data. Wrapper methods (forward, backward, and step-wise selection), filter methods (Analysis of variance (ANOVA), Pearson correlation, variance thresholding), and embedded methods (Least absolute shrinkage and selection operator (LASSO), Ridge, decision tree) are the three main categories of feature selection (McCombe [2019]). Various studies have been published using the methods described above as filtering (Tsamardinos et al. [2019]), wrapper (Kavakiotis et al. [2017]), and embedding (Li and Huang [2018]) for wholegenome SNP data. In addition, there have been several existing studies conducted using various algorithms including Recursive Feature Elimination 3 (Jeon and Oh [2020]), Genetic Algorithm (Mirjalili and Mirjalili [2019]), and Particle Swarm Optimization (Kennedy and Eberhart [1995]) to identify important features in high-dimensional data; but they are time-consuming, require high computational resources, and have an exhaustive iterative process. To tackle the high-dimensionality challenge using plant whole-genome data, this study is narrowed down to utilize five advanced penalized methodologies (such as Ridge, LASSO, Elastic Net, Group LASSO, and SGL) and a deep learning method (BNN). The penalized models used in this study can effectively handle a large number of predictors, resulting in more accurate and efficient results. Furthermore, BNN can capture complex non-linear relationships between genetic markers and phenotypes and reduce over-fitting. The following section provides a brief introduction to the machine and deep learning models used to develop two classifiers or regressors. 1.3 Research Models: An Overview 1.3.1 Penalized Models The penalized models (Ridge, LASSO, and their variants) generate sparse solutions which aid in feature selection. These are predictive models based on the expression of a small number of SNPs, enabling feature selection for high dimensional data (Ma and Huang [2008]). The magnitude of the penalty in penalized models can be tuned to choose the most important SNPs and other predictors. Hard thresholding reduces the dimensionality of a model by excluding variables whose coefficients are close to zero (Liu and Foygel Barber 4 [2020]). Ridge (Hoerl and Kennard [1970]), one of the penalized machine learning techniques, can force the least important coefficient weights close to 0 in the regularization paths. It is possible to choose a small subset of predictors from a large set of predictors, that is, reduce coefficient values exactly to 0 known as soft thresholding (Liu and Foygel Barber [2020]). LASSO (Tibshirani [1996]), Group LASSO (Yuan and Lin [2006]), Sparse group lasso (SGL) (Simon et al. [2013]), and Elastic net (Zou and Hastie [2005]) are some of those soft thresholding regularization algorithms with a varied penalty term. For instance, LASSO has l1 penalty which is the absolute value of the magnitude of coefficients. This study aims to combine five penalized models in two phases to develop a penalized-based workflow- PentaPen. PentaPen is a classifier and regressor developed for SNP identification to minimize the variance in the data. Additionally, there are some studies on penalized techniques for highdimensional data been conducted recently (Gao et al. [2014], Liu et al. [2013]). For instance, various studies (Van Wieringen et al. [2009], Bøvelstad et al. [2007]) find that Ridge outperforms LASSO for prediction performance. Whereas using a simulation study (Ogutu et al. [2012]), it was found that LASSO outperforms Ridge. However, to our knowledge, no detailed analysis has been done of how the penalized models compare favorably against one another for whole-genome SNP data. Our research intends to bridge this knowledge gap, offer a thorough pathway for comparing these models’ performance, report the computational time, and report the model complexity by the penalized models and PentaPen. 5 1.3.2 Deep Learning 1.3.2.1 Hypothesis test for preliminary screening Hypothesis testing is a widely used statistical method in genetics research for feature selection (Huynh-Thu et al. [2012]). Researchers often use chi-square hypothesis testing to select categorical/binary variables and ANOVA (Fisher [1954]) to select continuous variables to identify genetic variants that are associated with specific diseases or traits (Galesloot et al. [2014], Zhuang et al. [2012]). In this procedure, a subset of genetic markers, such as SNPs, that are most strongly associated with the phenotype is chosen. Feature selection is important because it can help reduce the dimensionality of the data and make it easier to find the most useful genetic markers (Wu et al. [2019]). In this study, we use chi-square and ANOVA tests to select preliminary features of the binary/categorical and continuous phenotype respectively from the original data to train deep learning models. Performing preliminary feature selection is essential to reduce the number of features as this allows neural networks to provide more accurate predictions. 1.3.2.2 Models Researchers used different deep-learning techniques for prediction and feature selection. For instance, various studies on high-dimensional data were conducted using Bayesian Neural Networks (BNNs). Neal and Zhang [2006] proposed a new method for high-dimensional classification that combines BNNs and Dirichlet diffusion trees. The authors aimed to improve classification accuracy in high-dimensional data sets by using these two techniques 6 together. The BNN provided a flexible and powerful model for classification, while the Dirichlet diffusion tree provided a way to handle the curse of dimensionality in high-dimensional data sets. Results showed that BNN applied with Dirichlet diffusion trees outperformed other deep learning models. The purpose of Gianola et al. [2011]’s study was to investigate the potential of some Bayesian Artificial Neural Network (ANN) architecture in predicting complex quantitative traits in two different data sets, Jersey cows and wheat. The results indicated that adding non-linearity from ANN along with Bayesian uncertainties improved the predictive ability in both data sets. Due to the strengths of Bayesian combined with deep learning models, the objective of this research is to develop a BNN-based workflow- BayesDL which makes predictions on selected SNPs using Chi-square/ ANOVA. BayesDL is a cascaded classifier and regressor of ANNs followed by Bayesian inference. BayesDL is developed using Stan (Carpenter et al. [2017]), a probabilistic programming language. Stan provides full Bayesian inference for various models, including linear and non-linear regression, time-series analysis, general linear models, generalized mixed models, hierarchical models, and more. The goal of Stan is to provide a simple platform for specifying complex models and doing inference using methods like variational inference (VI) and Markov Chain Monte Carlo (MCMC) algorithms. Additionally, we aim to use Stan to perform MCMC diagnostic and compare BayesDL’s prediction performance with the Convolutional Neural Network (CNN). Further, BayesDL aims to output important SNPs used for follow-up biological analysis. 7 1.4 GWAS Software: A brief overview GWAS is a study design used in genetics and genomics research to identify genetic variants (here SNPs) associated with physical traits. GWAS involves analyzing the entire genome of individuals to identify SNPs that are statistically associated with a particular phenotype. GWAS software refers to the computational tools and algorithms used to conduct association studies. The most popular GWAS software are PLINK (Purcell et al. [2007]), BOLTLMM (Loh et al. [2015]), FaST-LMM (Lippert et al. [2011]), GCTA (Yang et al. [2011]), TASSEL (Bradbury et al. [2007]), and GAPIT (Lipka et al. [2012], Tang et al. [2016]). Some researchers (Fu et al. [2021], Sardos et al. [2016]) worked on comparing important SNPs output from various GWAS software. One such study (Lipka et al. [2012]) evaluated the performance of several GWAS software tools, including GAPIT, on a maize dataset. The study found that GAPIT produced highly precise and computationally efficient results compared to other tools, such as TASSEL and PLINK. But there has not been enough work to find the shared SNPs from penalized or deep learning methods compared with GWAS software. The GWAS software is designed to handle large-scale genetic data, provides a comprehensive set of algorithms, incorporates statistical models, and includes functionalities for data preprocessing. The limitations of GWAS may lead to false-positive or false-negative results. 8 1.5 Research Objectives, Questions, and Contributions Given the limitations of previous research using penalized and deep learning methodologies, I aim to address the challenges and achieve the following goals displayed in Figure 1.1. The figure represents a block diagram for mapping the research objectives, questions, and expected thesis contributions. 1.6 Hypotheses There are three main hypotheses for this study, which are mentioned below: 1. PentaPen will outperform individual penalized models in terms of evaluation metrics. 2. BayesDL will outperform the deep learning model, CNN in terms of performance metrics for prediction. 3. The identified SNPs from PentaPen and BayesDL will demonstrate limited overlap with each other and with those obtained from existing GWAS software, due to the different models used. 9 Figure 1.1: Mapping research objectives, questions, and contributions. This diagram illustrates the formulation of research objectives, and research questions, and identifying contributions to a study. 10 1.7 Experimental Data To evaluate the performance of the developed workflows, two data from the model plant Arabidopsis thaliana, AtPolyDB and F1, are used for this study. Using the Arabidopsis data set for performance evaluation of the SNP analysis algorithm is justifiable as it is a widely studied model organism with a fully sequenced genome, high genetic diversity, and well-characterized SNP data sets. Developing an algorithm specific to the Arabidopsis genome can lead to more accurate and efficient SNP analysis. They are obtained from easyGwas website 1 . The easyGwas website is a repository of whole-genome SNP data that has been systematically processed is ready to be analyzed and is thoroughly documented. On the easyGwas website, there are presently 12 data sets including estimates for SNP counts as well as related publications for their exploratory analysis. The AtPolyDB dataset has 1307 samples with 214051 SNPs (or features) and the F1 data set has 372 samples with 204753 SNPs. Atwell et al. [2010] used the AtPolyDB dataset for the application of GWAS to study 107 phenotypes. Seymour et al. [2016] utilized the F1 data set to study the inheritance in Arabidopsis thaliana hybrids. Both data sets contain three files: (a) the PED file stores genotypic data with 1307 and 372 samples followed by total fields as 6 + 2 × p (p is the number of SNPs), (b) the PHENO file stores the phenotypic data having 1307 and 372 instances with 109 and 3 columns, and (c) MAP file has 214051 and 204753 rows, which contains the information about every single SNP, with four fields each row. Each row corresponds to 1 https://easygwas.ethz.ch/data/public/dataset/view/1/ and https://easygw as.ethz.ch/data/public/dataset/view/42/ 11 one SNP in the PED file. These are the Chromosome code, SNP(Variant) identifier, position in morgans or centimorgans, and base-pair coordinate. Table 1.1 represents the phenotypes selected for this study. The phenotypes from the AtPolyDB data set were chosen at 220 Celsius to consider all the phenotypic properties of the species during the summer season. These phenotypes are commonly used in genetic studies of A. thaliana because of their easily observable and measurable characteristics of seed development and flowering time. These properties can affect the yield and quality of crops. Three different variable types of phenotypes are selected for this study to increase the reliability of the developed workflows. Table 1.1: Phenotype data used in this study. Overview of Phenotype Data from AtPolyDB and F1-hybrids data sets, categorized by Variable Type. Variable Dataset Phenotype Type Number Phenotype Descripof sam- tion ples AtPolyDB Anthocyanin 22 177 Visual (AT P 172) presence anthocyanin Binary AtPolyDB Width 22 Continuous 175 Plant diameter 372 Plants (AT P 166) F1- DTF hybrids (AT1P6701) quently were subse- phenotyped for days to first open flower Categorical AtPolyDB Germ 22 177 Days to germination (AT P 163) 12 1.8 Performance Metrics Various quality metrics are used as there are three types of phenotypes used in this study as described in Section 3.1.2.7. Classification metrics such as Precision, Recall, F1 score, and AUC are used to evaluate the performance of classification models for the binary phenotype. Regression metrics like R-squared and RMSE are utilized to assess the goodness-of-fit of regression models for continuous phenotypes. For categorical phenotype, Accuracy is the multi-class classification metric used in this study. 1.9 Thesis Outline The remainder of this thesis is structured as follows. The literature review is presented in Chapter 2. Chapter 3 outlines the research based on penalized and deep learning research methodologies. The findings are presented in Chapter 4. The biological interpretations, conclusion, and potential future work for this research are presented in Chapter 5. The Appendix contains several Tables of results used for the comparison and interpretation of two methodologies. 13 Chapter 2 Literature Review The methodologies for GWAS, some common machine learning, deep learning, statistical approaches for SNP data, and related studies are briefly explained in this chapter. The methodology and data sets used in the published studies vary, as do the issues with whole-genome SNP data that the methods attempt to solve. 2.1 From DNA to SNP Deoxyribonucleic acid (DNA) comprises an extremely long chain of connected single units called nucleotides: Adenine, Thymine, Guanine, and Cytosine. Species share 99% of their make-up and only 1% is responsible for the diversity observed in the individuals (Lewontin [2003]). For example, it is evident from Figure 2.1 that in a particular location says one of the 95% of individuals has A nucleotide while one of the 5% species’ individuals has a T 14 nucleotide, each of these is called a variant. Given this location has multiple forms it is called the SNP. These are the key to understanding the genetic causes of an organism’s traits (Bush and Moore [2012]). SNPs can have functional changes such as amino acid changes that cause the alteration to the mRNA transcript stability and changes to transcription factor binding affinity (Robert and Pelletier [2018]). Figure 2.1: A graphical explanation of Single Nucleotide Polymorphism (SNP). The diagram illustrates the concept of SNP, a common genetic variation characterized by a single base pair change in the DNA sequence. 2.1.1 GWAS Analysis Figure 2.2 displays the steps to be followed for GWAS analysis (Uffelmann et al. [2021]) which are described below: (a) Data is collected to find an appropriate number of samples of individuals who differ in the trait of interest and the genotypic information is 15 obtained, Figure 2.2: Steps for GWAS Analysis. This diagram outlines the sequential steps involved in conducting a GWAS analysis, starting from data collection (a) to obtaining final SNP results (g) for further statistical analysis. (b) for genotyping, microarrays can be used to detect common variations or next-generation sequencing techniques can be used for whole-genome sequencing (WGS) or whole-exome sequencing (WES), (c) the wet-laboratory procedure, such as genotype calling and DNA switches include the isolation of DNA from all the subjects using a DNA chip (a molecular kit) to identify the SNP alleles (a variant form of a gene) they have at each SNP genome position, (d) to maximize detection sensitivity for variant calling, pipelines should include various tools (FreeBayes, GATK, Platypus, Samtools/mpileup) for each kind of variation (Koboldt [2020]); pipelines for SNP calling 16 rely on a reference genome. The most popular approach for this is to select one genome assembly as the standard against which all others will be measured (Leggett and MacLean [2014]), (e) the dry-laboratory procedures on called genotypes, such as eliminating undesirable SNPs and individuals, identifying population strata in the sample, and calculating principal components involves quality control, (f) look through all the SNPs that are examined to find SNPs where the population has different allele frequencies and SNPs with genetic differences will be marked, and (g) the final SNP results will then be used as input data to carry out GWAS with traditional statistical methods. Anonymized individual ID numbers, kinship relationships between individuals, sex information, phenotype data, covariates, genotype data for variants, and batch genotyping information are all included in the input files for a GWAS. 2.2 GWAS Program Association tests can be carried out in multiple ways using statistical models. The simplest is the generalized linear model (GLM) usually used for quantitative traits and genotype classes. For control traits, logistic regression is generally preferred to predict the probability of having case status given a genotype class. This method is preferred as it allows adjustment of covariates and odds ratio as a measure of effect size (Bush and Moore [2012]). 17 Regression analysis is performed for every SNP in the dataset, taking each subject as a data point. The regression model of the GWAS (Cantor et al. [2010]) can be represented as: Z = Xθ + ϵ (2.1) where X is a n × (p + 1) matrix of predictors, a (p + 1) × 1 vector θ consists of their coefficients, and matrix Z is the n × 1 matrix of a phenotype. Here, n denotes an individual and p denotes a predictor (SNP). In ordinary linear regression, the variation in a trait can be fully described by two parameters - a common variance σ 2 and a mean vector in the design matrix Z that is unique to each individual in the sample (n). The residuals, or deviations (ϵ) from predicted trait values, are assumed to be normally distributed. A regression line is fitted to the data that best predicts the relationship between several alleles or SNPs and a phenotype. The p-value is a statistical measure used to assess the significance of the association between the SNPs and phenotype. It measures the likelihood that the association found in the distribution of data points was due to random chance, given that there is no true association. The stronger the data points cluster around the regression line, the less likely the association is due to random chance (Uffelmann et al. [2021]). A smaller p-value indicates that the observed association is less likely to be due to random chance. For each SNP, the p-value and the slope (effect) of the regression line are recorded. However, working with millions of SNPs can be computationally expensive. After recording the pvalues, a significance test is performed to determine whether the association is statistically significant. The null hypothesis is that there is no significant association between the SNPs and phenotype. A p-value threshold of 0.05 is commonly used to determine whether the association is significant or not. 18 If the p-value is less than 0.05, it is concluded that the association is not due to random chance and is considered to be present. Visualization is done using the Manhattan Plot (which resembles the Manhattan skyline) where each SNP position on a chromosome is represented on the x-axis and its associated p-value on the y-axis is precisely negative of the log of a pvalue. Each dot represents each SNP position in each chromosome and the dot’s height indicates a significant difference. Hence, dots above a certain horizontal line (threshold) represents SNPs that are significantly associated with the chosen physical trait (Ikram et al. [2010]). 2.2.1 Advantages and Limitations of GWAS The GWAS program is advantageous as it allows for adding covariates (other effects that may affect the parameter of interest) along with genotypes. The GWAS has limitations due to genetic confounding and complex genetic architectures (Korte and Farlow [2013]). The heterogeneity of genetic origins among the individuals in association studies may lead to false-positive or false-negative results (Hu and Ziv [2008]). This can be due to the lack of replication of the association study. Since many of the phenotypes of interest in living beings may be complex (combination of genetic and environmental factors), GWAS may be unable to find the causative loci we seek. One possible solution is to reduce the number of loci associated with the trait improving the detection power. This can be done using the LASSO (Korte and Farlow [2013]). However, with p = 0.05, there is still a chance of producing a false positive result, that 5% of significant results are still due to random chance. 19 Working with millions of SNPs leads us to produce thousands of false positives, muddling the results and diminishing the study’s statistical power. Furthermore, the high dimensionality of current whole-genome SNP data presents significant hurdles for computationally intensive GWAS techniques like permutation testing (Li et al. [2018a]). Some of the approaches that can be used to correct the multiple testing (more false positives over the entire GWAS analysis) are the Bonferroni correction and Permutation testing (Bush and Moore [2012]). 2.2.2 GWAS Software PLINK (Purcell et al. [2007]), BOLT-LMM (Loh et al. [2015]), FaST-LMM (Lippert et al. [2011]), GCTA (Yang et al. [2011]), TASSEL (Bradbury et al. [2007]), GAPIT (Lipka et al. [2012], Tang et al. [2016]), and others are some of the most extensively utilized GWAS software. They are all based on standard statistical models, and many of them can be used to choose LMM, GLM, or other models. PLINK is a free, open-source toolkit for population-based linkage analysis and GWAS. It provides a range of fundamental yet fast computations for huge biological data sets. It was one of the first GWAS software and was created for human genomes. PLINK is now regarded as a standard GWAS approach in several research disciplines. BOLT-LMM uses a Bayesian mixed model that is said to improve statistical power and processing speed for larger data sets. Loh et al. [2015] applied BOLT-LMM to the Women’s Genome Health Study (WGHS) and 20 collected data on nine phenotypes in 23,294 samples. FaST-LMM proposes to scale linearly with data size in both runtime and memory by using a factorized log-likelihood function in the LMM. It may be a potential software for achieving satisfactory results with enormous data. GCTA is a technique for analyzing complicated traits across the genome. It solves the “missing heritability” problem of human genomes by using LMM to fit the contribution of all SNPs as random effects. Trait analysis by aSSociation, Evolution, and Linkage (TASSEL) and Genome association and prediction integrated tool (GAPIT) were created with GWAS of agricultural plants in consideration. TASSEL was originally designed and tested for maize to handle various insertions and deletions. Many other programs did not take these polymorphisms into account previously. GAPIT was created after TASSEL used comparable statistical methods to TASSEL. It can do both GWAS and Genomic Selection (GS) analyses (Goddard and Hayes [2007]). GS is a marker-assisted selection method in which breeding values are anticipated using genetic markers found throughout the genome (usually SNPs). GAPIT is regularly updated to provide the most recent techniques. As a result, it is said to produce the most precise and computationally efficient outputs. Multiple loci mixed model (MLMM), fixed and random model circulating probability unification (FarmCPU), and Bayesian-information and linkagedisequilibrium iteratively nested keyway (BLINK) were the three multi-locus test techniques that were implemented by version 3 of GAPIT (Wang and Zhang [2021]). Additionally, two GS techniques built on CMLM—compressed BLUP (cBLUP) and SUPER BLUP (sBLUP) were implemented. These new 21 solutions increase GWAS statistical strength and GS prediction accuracy, processing speed, and the ability to evaluate large amounts of genomic data. 2.2.2.1 Bonferroni Correction A multiple-comparison method known as the Bonferroni correction is used when several statistical tests are run together (Chen et al. [2021], Kaler and Purcell [2019]). It is a common approach to setting the local significance level as the global significance level (α) divided by the number of SNP variables (p). This was used in GWAS to reduce the chances of obtaining false-positive rates and to identify the important SNPs from the outputs. In this study, the GWAS software output was adjusted using the Bonferroni correction to identify the important SNPs from the experimental data sets. This correction was needed because the software output contained a different number of SNPs, making it an incompatible comparison with the output of the developed workflows. 2.2.3 Genotypes There are mainly three common SNP data formats for the genotypes. They are categorized as PLINK, Hapmap, and VCF. PLINK contains two types of files ped and map files. The samples are stored in rows while SNPs are stored in columns for the ped file. Table 2.1 represents the general structure of the ped file. It has a family ID, individual ID, paternal ID, maternal ID, and gender(1=male; 2=female; 0=unknown) as the first five columns. The sixth column in the ped file is generally re22 served for the phenotype data. The rest of the columns are SNPs for each chromosome2 . Table 2.1: Data structure for ped file. Overview of PED File Structure: Family ID, Individual ID, Parental IDs, Gender, Phenotype, and SNP Information famid iid fid mid sex pheno SNP1.1 SNP1.2 ... 9381 9381 0 0 0 0 T T ... 9380 9380 0 0 0 0 C C ... 9378 .. . 9378 .. . 0 .. . 0 .. . 0 .. . 0 .. . T .. . T .. . ... .. . The map file has rows as SNPs and columns as SNPs’ locations. Table 2.2 displays the widely used map file format. The first two columns are chromosomes and the SNP identifier of respective chromosomes, while the other two columns give the genetic distance (morgans) and base-pair position (bp units). Table 2.2: Data structure for the map file. Overview of MAP File Structure: Chromosomes, SNP Identifiers, Genetic Distance, and Base-Pair Positions Chr SNP GD BP 1 rs6681049 0 6681049 1 rs4074137 0 4074137 1 .. . rs6704013 .. . 0 .. . 6704013 .. . The other format is Hapmap text-based file. The Hapmap file format is 2 https://zzz.bwh.harvard.edu/plink/data.shtml 23 a table with 11 columns and one column for each genotyped sample. Each further row has all of the information associated with a specific SNP, with the first row containing the header labels of the samples (Chang [2020]). A chromosome-specific or generic Hapmap file or a generic file can be acquired to get the HapMap data format (Richard et al. [2003]). The attributes stored in any HapMap file are displayed in Table 2.3. Table 2.3: Overview of the attributes of HapMap- a genotype file. Attributes Meaning rs# SNP identifier alleles SNP alleles chrom the chromosome of each SNP pos position of SNP on that particular chromosome strand SNP’s orientation(forward (+) or backward (-)) within the DNA strand assembly# NCBI reference sequence assembly center genotyping facility that generated the genotypes protLSID HapMap protocol identifier assayLSID HapMap assay identifier used for genotyping panelLSID panel of individuals genotyped identifier QCcode quality control for all entries; Parental#/ID# subsequently, the list of sample names Originally the 1000 Genomes Project created the VCF (Variant Call Format), a standardized format. Meta-information, header, and data lines comprise the VCF format. The meta-information is organized into rows, with the first one being required and containing the VCF format version. More, but not essential, lines indicating the file’s origin can be added after the first row, 24 such as the file’s creation date, source, reference, phasing, etc. The following entries are found in the information lines (##INFO) as ID, Number, Type, and Description. The header line comes after the meta-information lines, which contain all of the information and format data. The header line must include a minimum of eight fixed columns that are tab restricted. It also has a column named FORMAT, followed by the sample IDs if it has genotypic data. The following is the syntax of the information line: ##INFO = After the information lines, a few lines indicate the format of each variable of the data. The syntax for the ##FORMAT line is as follows: ##FORMAT= The header line after the meta-information lines contains the following columns and their respective data information as displayed in Table 2.4. The data lines tab is limited after the header line, with ’.’ representing missing data (Lyon et al. [2021]). 2.3 Common Statistical and Machine Learning Algorithms The popular Machine Learning (ML) and statistical algorithms (aside from penalized models used in this study) that have been discussed in the academic literature concerning whole-genome SNP data are covered in this section. 25 Table 2.4: Columns of VCF (Variant Call Format)- a genotype file . Columns Meaning #CHROM an alphanumeric string of chromosome POS an integer displaying the position in the chromosome ID an alphanumeric string of the identification REF reference base(s) ALT alternate non-reference base(s) QUAL the quality score for the assertion made in ALT FILTER PASS passed in all filters otherwise should contain the reason INFO all about INFO described earlier separated by a semicolon FORMAT all the format IDs separated by a colon IDs a tab-separated list with sample identification We have already seen that the high dimensionality of data sets like wholegenome SNP data presents a challenge. There are several dimensionality reduction methods available. There are various dimensionality reduction techniques such as principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), linear discriminant analysis (LDA), Independent Component Analysis (ICA), and uniform manifold approximation and projection (UMAP) used by researchers. Nahlawi [2010] proposes a comparison of various dimensionality reduction algorithms for genetic feature selection, which identifies the most informative subset of genetic features from a large and complex set of genetic data. The algorithms compared in the study included popular dimensionality reduction techniques such as PCA, ICA, and Fast Orthogonal Search. Yan and Wang [2022] used popular dimension reduction methods such as PCA, LDA, t-SNE, UMAP, and ICA to reduce the dimensionality of large and complex omics data, such as genetic 26 and genomic data while preserving the most relevant information. However, these methods are not employed in this thesis because of their computational complexity. Various ML models such as Random Forest (RF), Gradient Boosting Machine (GBM), Support Vector Machine (SVM), and Gradient Boosting Tree (GBT) are useful for searching SNPs in high-dimensional SNP data. Szymczak et al. [2009] applied RF and GBM to find important SNPs for GWAS. Mieth et al. [2016] proposed a two-step novel algorithm whose first step was to use SVM for training for determining a subset of SNPs. Behravan et al. [2018] utilized GBT followed by SVM to identify important SNPs. 2.4 Penalized Methodologies in the existing literature: A Brief Overview In this section, we discuss the evidence that the penalized approaches for whole-genome SNP data have been used in the literature. It is included for discussion here if a published study analyses at least one penalized model. Penalized models are increasingly common for dealing with high-dimensional data. The basic concept of any penalized technique is to add a penalty term to the loss function of the model that shrinks the coefficients toward zero. This helps to select important variables and avoid over-fitting. Several types of penalized models have been implemented in this study, such as LASSO, Ridge, Elastic Net, and Group LASSO, and their variations. LASSO and Ridge are two of the most popular methods that have been widely used in genomics research. LASSO imposes an l1 penalty that sets 27 some coefficients to exactly zero, thereby performing the variable selection. Ridge imposes an l2 penalty that shrinks the coefficients towards zero but does not set any of them exactly to zero. Elastic Net is a combination of both LASSO and Ridge. Group LASSO is another method that performs variable selection at the group level, meaning that it selects entire sets of variables together. Guo et al. [2019] aims to improve the power to detect genetic variants associated with quantitative traits by combining the SGL and the linear mixed model (LMM). This is because the SGL can identify a small set of relevant variables, while the LMM can account for the effects of fixed and random factors on the outcome variable. SGL-LMM creates a fixed zero effect to learn the parameters of random effects using LMM and then uses SGL regularisation to estimate fixed effects. Li et al. [2015] proposed a Bayesian Group LASSO approach for variable selections in non-parametric varying-coefficient models. Group LASSO results can be viewed as posterior mode estimates from a Bayesian perspective. However, this thesis neither focuses on applying variations of Group LASSO (except SGL) nor combining any method with LMM. In the next section, there is a discussion of a comparison study of the existing penalized methodologies. 2.4.1 Comparative Studies of penalized methodologies Waldmann et al. [2013] used the LASSO and the Elastic net to perform GWAS on two simulated data sets. They compare the performance of the 28 LASSO and the Elastic net using various performance metrics, such as the prediction accuracy, and the false positive rate, concluding that both methods should be used for analysis. In their study, Okser et al. [2014] illustrated the similarity and differences in the behavior of various common regularized models such as LASSO, Elastic Net, and regularized-least squares (RLS) wrappers on two data sets. The findings of their study concluded that LASSO and Elastic Net showed similar prediction behavior but they suggest using a large number of variants for better performance when compared with the greedy RLS wrappers. They also found that these methods are well-suited for important SNP identification. Srivastava and Chen [2010] compare Stochastic Search Variable Selection (SSVS), LASSO, and Elastic Net using genotype data of 60 unrelated individuals from the CEU population in the Hapmap project. Their study provides that all three methods handle data with more features and fewer samples. The results of their study demonstrate that SSVS outperforms LASSO and Elastic Net based on the quality metrics used. To ascertain the efficiency in identifying actual causative SNPs, Zuber et al. [2012] undertake a thorough comparison research involving five advanced regression approaches to determine the effectiveness among them in finding causal SNPs. These approaches include boosting, LASSO regression, regression with NEG prior, regression with MCP penalty, variable shrinkage by CAR (correlation-adjusted marginal correlation) scores, and a univariate strategy (marginal correlation). They found that the CAR-based algorithm outperforms all competing algorithms in terms of correctly recovered causal SNPs and SNP ranking. 29 2.4.2 Algorithms developed based on penalized methodologies Several alternative algorithms have been created by combining the various statistical and machine-learning methodologies, some of which are covered in the previous section. A few of these techniques are discussed below. Hoffman et al. [2013] develop and evaluate a combined algorithm and heuristics framework incorporating penalized maximum likelihood penalty. PUMA (Penalized Unified Multiple Analysis) is a unified framework that aims in solving problems of proposed penalized multiple regression (PMR) algorithms. The challenges include computational speed, poor performance on genome-scale simulated data, and identification of too many associations for real data to be biologically plausible. This framework incorporates the penalized maximum likelihood penalties for GWAS analysis (i.e., LASSO, Adaptive LASSO, NEG, and MCP) and a penalty that has not been used for GWAS analysis before (i.e. LOG). They found that the developed framework has high performance than existing PMR methods. It also shows reliable results to increase the detection power of associations between SNPs and phenotypes. Zuber et al. [2012] developed a novel multivariate algorithm for largescale SNP selection utilizing CAR score regression, a promising new method for ranking biomarkers. The algorithm is likely based on statistical methods, such as penalized regression, or machine learning. The results depicted the CAR-based algorithm’s superiority over existing models for finding casual SNPs. 30 Motivated by the researchers’ development of algorithms like PUMA to detect association, this thesis aims to develop a well-established workflow combining penalized models to identify important SNPs. 2.5 Neural Networks in the Existing Literature: A Brief Overview Various deep-learning methods like Artificial Neural Networks (ANNs), Deep Neural Networks (DNNs), and Convolutional Neural Networks (CNNs), and their variations have been applied to whole-genome SNP data to identify genetic variants such as SNPs, predict the phenotypic outcomes of SNP patterns, and understand the underlying biological mechanisms of diseases and traits. The evidence of the neural networks published in the previous studies using the whole-genome SNP data is discussed in this section. Though the use of deep learning in the current research is scarce, we include the possible studies involving neural networks. Badré et al. [2021] compared various computational models such as Best Linear Unbiased Prediction (BLUP), DNNs, and other statistical models to estimate breast cancer polygenic risk scores (PRS). Their study findings concluded that DNNs outperformed various machine learning and statistical algorithms such as BLUP. Further interpretation of DNNs identifies important features for DNN predictions. These variants were also found to be associated with phenotype through non-linear relationships. Liu et al. [2019] proposed a deep-learning framework using CNNs to predict quantitative traits from genetic variants such as SNPs. Additionally, 31 they investigated genetic contributions to traits using saliency maps. They also found that this deep learning model achieved more accurate results by bypassing the imputation of missing values and treating them as new genotypes. In this thesis, the existing deep learning models like DNN and CNN, and their variations are not used for SNP identification. The next section provides a further comparison of various deep learning models with existing linear and non-linear models. 2.5.1 Comparison studies involving neural networks Romagnoni et al. [2019] utilized various GBT models such as XGBoost, LightGBM, and CatBoost, together with ANNs using one or more hidden layers to identify and classify genetic markers such as SNPs. Their study found that the non-linear methods such as GBT or ANNs are complementary to each other for identifying genetic markers and have similar prediction performance. To predict phenotypic information from SNP data, Liu et al. [2019] suggested an independent deep CNN model. They were the first to choose significant biomarkers (SNPs) from their training model using a saliency map deep learning visualization technique. The proposed framework was compared with various traditional statistical techniques, including Bayesian ridge regression (BRR), Bayesian Lasso, and BLUP. They found that their deep-learning framework outperformed these traditional models in detecting biomarkers and their interactions. 32 2.5.2 Algorithms developed based on neural network Although research on neural networks is scarce, some studies focus on combining statistical and deep learning algorithms to form an algorithm to increase model performance or perform feature selection. This section discusses different algorithms applied together to improve the performance or perform comparative studies. Waldmann [2018] aims in solving the high-dimensionality problem using Approximate Bayesian Neural Networks (ABNNs) to identify important SNPs. They evaluate the performance of ABNNs in the context of genomic prediction. The ABNN model was implemented in MXNET; it was demonstrated that the model produce greater prediction accuracy than Bayesian LASSO and genomic BLUP. This model could be used for providing information on important SNPs. The performance of a non-parametric Bayesian approach in the form of a Bayesian neural network (BNN) for detecting causal SNPs in genetic association studies was evaluated by Beam et al. [2014]. BNNs are a type of machine learning model that incorporate Bayesian statistical methods to make predictions based on large and complex data sets. Using real and synthetic data, they found that BNNs outperform commonly used approaches for finding SNP interactions across a wide range of potential genetic links. An algorithm for creating a single hidden layer feed-forward neural network was described in the study by Setiono and Hui [1995]. Using the quasiNewton method, this algorithm stands out because it minimizes the series of error functions connected to the expanding network. The algorithm is quite 33 effective and reliable, according to experimental data. Motivated by the algorithms developed by Waldmann [2018], Beam et al. [2014], Setiono and Hui [1995] a cascaded workflow of an ANN followed by Bayesian inference, is developed in this thesis using RStan for important SNP detection. The prediction performance of BNN-based workflow is compared with CNN’s performance. The quasi-Newton method is used for optimization while predicting the genotypes in the developed workflow. 34 Chapter 3 Research Methodology The chapter revolves around the two major workflows proposed in this research using Machine Learning and Deep Learning methodologies. This chapter is divided into two main sections: Machine Learning and Deep Learning for GWAS. The corresponding subsections describe the experimental research design, theory of the models, and proposed workflow. 3.1 Machine Learning for GWAS In this section, the focus is on research that employs penalized methodologies. The section covers a range of topics, including the design of the research study, the underlying theory behind penalized classification and regression models, K-fold cross-validation, and a workflow based on penalized models (PentaPen). 35 3.1.1 Experimental Research Design In this study, we develop an improved penalized-method-based workflowPentaPen- to find important SNPs combined with different penalized models. “Penta” and “Pen” are abbreviated for “five” and “penalized models” respectively. PentaPen is a classifier and regressor which is developed for SNP identification. It aims to minimize the value of the loss function while simultaneously optimizing performance metrics. Firstly, all the SNPs of wholegenome SNP data are utilized for training Ridge, LASSO, and Elastic Net. The union of the output SNPs from these three models is taken as the selected SNPs known as SNP Pooling. Secondly, the selected SNPs are sent to train Group LASSO and SGL, and the union of the output SNPs from the two is the final output of the PentaPen. Finally, an aggregated model is developed by combining the predictions of all five penalized models; this model is used to calculate the performance metrics of PentaPen. Since R is the programming language used in this study, pre-processing is needed to make the data compatible with penalized method packages in R. After loading genotype (.ped) and phenotype (.pheno) files obtained from the easyGwas website in R and the data is pre-processed by converting the chromosomal nucleotide to numeric values and creating a design matrix of dimension p × n. The matrix is transposed to form a matrix of the same dimensions as the genotype data. Null values in the phenotype data are removed; they are imputed with the mean across all columns for the SNPs in the genotype data. The process above results in the pre-processed data. In pre-processing SNP data, we choose not to remove SNPs in Linkage disequilibrium (LD) (Korte and Farlow [2013]) and rare variants. Excluding 36 LD and rare variants can lead to losing important genetic information in certain populations, so keeping them helps provide a complete picture of genetic architecture. Although we do not check for the potential interaction effects (LD) between SNPs, keeping SNPs as is can improve model performance. Rare variants of SNPs were not checked for in our data set, but including them is beneficial as they can have a larger effect size than common variants and contribute to phenotypic traits. Although haplotype information could provide better results for imputing SNPs, it can be complex and requires more computational resources and time. According to Shi et al. [2018], IMPUTE2, a haplotype imputation software, was the most time-consuming. Additionally, haplotype information is not always available or accurate for plant genome data. Imputing with the mean is preferred for efficiency and good results when the missing proportion is low. Although, there were no missing values in the data used in this study, imputing by mean would be better for missing at random (MAR) data, and generally, the whole-genome SNP data is MAR data. After pre-processing, the resulting clean data are split into training and testing folds. Further, for 5-fold cross-validation (CV), the models are fitted using R-packages- glmnet (Friedman et al. [2010]) for Ridge/ LASSO/ Elastic net, gglasso (Yang and Zou [2015]) for Group LASSO, and SGL (Simon et al. [2018]) for SGL. The data type for chosen phenotypes was binary, categorical, and continuous; hence, this study carries out both regression and classification using the stated methodologies. All the feature selection methodologies are utilized using the same R packages for regression and classification. The family argument is used as a binomial in the glmnet package for classification whereas the default Gaussian family is used for regression. The loss function for Group LASSO and SGL is considered as 37 logit for classification. However, the default least squares loss function is used for regression. These steps are followed to train the models in PentaPen. PentaPen is implemented in-parallel across 5-folds using the fitted model to get output as follows from each penalized method: (a) the SNPs’ coefficients higher than the mean absolute values are recorded to output the important SNPs based on their coefficients and (b) the prediction for the phenotype is carried out on both training and testing sets using the best λ reported using CV. The steps to choosing the best λ are explained in Section 3.1.2.6.1. Further, to compare the performance of different models, we use different metrics which are introduced in sections 3.1.2.7.1, 3.1.2.7.3, and 3.1.2.7.2 with the help of an optimal cutoff. The workflow allows the comparison of single penalized models among each other to gain insights into which model is most effective in reducing over-fitting and improving the model performance. The purpose of comparing the performance of each model among themselves is to give a deeper understanding of the strengths and weaknesses of each penalized model and to identify a more reasonable number of SNPs. Additionally, to evaluate the performance of PentaPen, the following results are interpreted: the comparison is made between (a) the performance of the workflow against all the five penalized models using all SNPs, (b) the computational time with every single model (Ridge, LASSO, and Elastic net), and (c) the number of SNPs selected by each penalized model. Finally, the important SNPs from PentaPen are validated to locate genes. They are also compared with the SNPs output set of GAPIT and TASSEL. The proposed workflow aims to enhance the confidence of the selected SNPs by leveraging the beneficial properties of five penalized methodologies. As a result, combining multiple penalized models can improve performance 38 by reducing over-fitting as compared to using one model. The workflow for SNP identification can also increase the confidence in choosing an SNP set as it is more probable to select informative SNPs than using a single penalized model. Hence, the union of SNPs from Group LASSO and SGL allows for further analysis and selection of a reduced number of SNPs since SGL’s sparsity group-wise and within-group results in too few or no SNPs. The schematic diagram in Figure 3.1 illustrates the steps in this study. 3.1.2 Methods In this section, I introduce the five penalized models that are combined to develop PentaPen. The models that are used for detecting potentially important features (SNPs) are Ridge, LASSO, Elastic Net, Group LASSO, and SGL. The loss function of models (Ridge, LASSO, and Elastic Net) used before SNP Pooling was the same while their penalties changed from l1 , l2 , and both respectively. After SNP Pooling, the models (Group LASSO and SGL) chose features based on distinct groups. This study uses K-fold CV to find an optimal λ throughout. 3.1.2.1 Ridge Whittaker et al. [2000] created the first proposal for using Ridge regression for prediction in quantitative genetics. The Ridge has an l2 penalty (||θ||) calculated by Euclidean distance metric as follows: l2 -norm = ||θ|| = p X θ2 . (3.1) i=1 39 Figure 3.1: Study Design for penalized models flowchart. This flowchart illustrates the study design for penalized models starting from Pre-processing to important SNP identification and validation. The dashed box indicates the parallel computing across 5-fold. For p predictors, matrices Xn×(p+1) and Zn×1 are represented for SNPs and phenotype respectively. The objective function of Ridge (Hoerl and Kennard [1970]) with p predictors consists of the loss function ( n1 ||Z − Xθ||2 ) and 40 penalty term (λ||θ||). Mathematically, it is represented as: 1 argmin( ||Z − Xθ||2 + λ||θ||) n θ (3.2) where n1 ||Z − Xθ||2 is the loss function, λ||θ|| is the penalty term, and λ ≥ 0 is the tuning parameter to be estimated using cross-validation. 3.1.2.2 LASSO The Least Absolute Selection and Shrinkage Operator (LASSO) has an l1 penalty (|θ|) calculated by the Manhattan distance metric represented as: l1 -norm = |θ| = p X |θ|. (3.3) i=1 Suppose the p predictors (SNPs) are arranged in the design matrix Xn×(p+1) and phenotype is arranged as matrix Zn×1 , then the objective function of the LASSO model (Tibshirani [1996]) for p predictors is: 1 argmin( ||Z − Xθ||2 + λ|θ|), n θ (3.4) where n1 ||Z − Xθ||2 is the loss function, λ|θ| is the penalty term, and λ ≥ 0 is the tuning parameter to be optimized using CV. Unlike Ridge, LASSO can produce more decipherable models by making certain coefficients exactly equal to zero. 3.1.2.3 Elastic Net The Elastic net, proposed by Zou and Hastie [2005], combines l1 and l2 penalties. As a result, the Elastic net benefits from both the Ridge and 41 LASSO model’s characteristics. The objective function of Elastic net for p predictors is: 1 argmin( ||Z − Xθ||2 + λ[α|θ| + (1 − α)||θ||]), n θ (3.5) where parameter α determines the mix of penalties and is chosen on qualitative grounds. A sparse model is produced by the l1 norm portion of the penalty, while very large coefficients are shrunk by the l2 norm portion of the penalty (Mahdi et al. [2021]). 3.1.2.4 Group LASSO The Group LASSO model separates the predictor variables into g distinct groups. This can be useful for handling categorical data, where the groups may represent factor-level indicators. Vector θ should be approximated so that it only needs a few of those groups as opposed to just sparsity. Yuan and Lin [2006] proposed the objective function of the Group LASSO model for a symmetric and positive definite kernel matrix to estimate θ̂ as: argmin(||Z − θ g X i=1 2 (Xi θi )|| + λ g X (||θiT Ki θi ||1/2 )), (3.6) i=1 where Xi is a n × (pi + 1) sub-matrix of X (columns corresponding to SNPs in group i) and θi is a sub-vector of coefficients of length pi + 1; Ki = pi Ipi denotes the kernel matrix; the regularisation parameter is denoted as λ ≥ 0. 42 3.1.2.5 Sparse Group Lasso (SGL) A sparse set of groups is produced by Group LASSO where including a group in Group LASSO keeps all coefficients in the group as non-zero. For instance, when predictors are SNPs and the important SNPs have to be discovered then both sparsity of groups and within each group should be considered. Simon et al. [2013] proposed the SGL estimator model as: argmin(||Z − θ g X 2 (Xi θi )|| + (1 − α)λ i=1 g X ||θiT Ki θi ||1/2 + αλ|θ|). (3.7) i=1 where λ ≥ 0 is the tuning parameter and α ∈ [0, 1] is a convex combination and α = 0 gives Group LASSO model and α = 1 gives a LASSO model. Similar to Group LASSO, SGL chooses features on a group basis, and specific features group-wise and within each selected group. 3.1.2.6 K-Fold Cross-Validation The study employs K-fold CV to determine an optimal value of λ, which is used to evaluate the performance of the penalized models during validation and training. The purpose of it is to find a best-fit model for the database of the study and flag problems like over-fitting (Berrar [2019]). A method to train the model using a subset of the data and employing the complementary subset of the data for the model validation is known as cross-validation (CV). K-Fold CV (k-fold CV) is one of the methods of CV. In k-fold CV (Jung and Hu [2015]), all observations are randomly divided into k parts/folds of approximately equal size. At every iteration, there is a different subset reserved for testing. 43 3.1.2.6.1 Best λ To tune the hyper-parameter (λ) in k-fold CV, a grid of values for λ is used. The validation performance metric is calculated for each λ within each fold. Further, the overall CV performance metric is calculated for each λ. The λ with the optimized metric is located and is said to be the minimum CV λ or best λ. Model performance in this study is assessed using a 5-fold CV. The best λ considered is the λ within 1 standard deviation above the minimum λ. 3.1.2.7 Performance Metrics This section covers the various performance metrics used for evaluating the models’ performance. It can be noticed from Section 1.7 that three types of phenotypes are used in this study: binary, continuous, and categorical. Hence, there will be three types of evaluation metrics as described below to perform classification, regression, and multi-class classification. 3.1.2.7.1 Classification The binary phenotype, Anthocyanin, in our study, is imbalanced where the probability of class 1 is 36% and the probability of class 0 is 64%. To predict the presence of anthocyanin, we need to weigh the positive class or class 1 more than class 0. Hence, the cost-efficient evaluation metrics, F1 score, Precision, and Recall are used (Brownlee [2020]). Further, AUC is used to measure the classifier’s ability to distinguish between both classes. Precision measures the accuracy of positive predictions. It is the proportion of positive predicted cases that are truly positive. Mathematically, 44 the precision can be expressed as follows: Precision = TP , TP + FP (3.8) where TP and FP are True Positive and False Positive respectively. Recall measures the completeness of positive predictions. It is the proportion of all positive cases that are correctly predicted positive. It calculates the number of Actual Positives that the model detects by labeling it as Positive (True Positive). When there is a high cost associated with False Negative (FN), Recall is the model metric used to select the better model compared to other models. This metric can be mathematically written as: Recall = TP . TP + FN (3.9) The F1 score is the weighted average of precision and recall. The F1 score have the following formula (Goutte and Gaussier [2005]): F1 score = 2 ∗ Precision * Recall . Precision + Recall (3.10) Sensitivity (or true positive rate or Recall) is the probability that a test will result in a true positive outcome. The likelihood of a negative test provided itact negative is referred to as specificity (or true negative rate). The true negative rate (TNR) and the false positive rate (FPR) are calculated as follows: TN , and TN + FP FP FPR = 1 − Specificity = . TN + FP Specificity (TNR) = (3.11) (3.12) A graphical representation of the trade-off between sensitivity and specificity at each “cut-off” for any diagnostic test is provided by the receiver operating characteristic (ROC) curve (Park et al. [2004]). 45 The optimum “cutoff” value with the highest sensitivity and specificity is generally accepted as the one that maximizes the area under the ROC curve (AUC), providing the best balance between sensitivity and specificity. AUC represents the discriminatory power of a diagnostic test (Faraggi and Reiser [2002]). AUC’s computing cost, however, is greater than that of Accuracy and F1 score (Hossin and Sulaiman [2015]). 3.1.2.7.2 Regression The performance metrics R-squared and Root Mean Square Error (RMSE) were averaged across 5-folds for the continuous phenotype in this study to evaluate regression model performance. R-squared measures how much of the variation in the dependent variable the model can account for. Mathematically, R-squared is represented as: R2 = 1 − SSError . SST otal (3.13) Error where SS is the proportion of total variation that cannot be explained SST otal by the model. Mean Squared Error (MSE) denotes an absolute measure of the fit’s goodness. MSE is a measure of the average of the squares of the errors between the predicted and actual values in a regression analysis. It is determined by adding together the squares of the differences between the real output and the predicted output and dividing the result by the total number of data points. The square root of MSE is called Root Mean Square Error which is abbreviated as RMSE. It can be defined in terms of the sum of the square of errors as: s RM SE = SSError N −p−1 (3.14) 46 where SSError is the unexplained variability of the dependent variable. 3.1.2.7.3 Multi-class classification This study uses Accuracy aver- aged across 5 folds to evaluate the multi-class classification model’s performance for the categorical phenotype. Accuracy measures the true positive and true negative outcomes ratio for the entire data. It estimates the number of correct predictions over the total number of predictions. The mathematical way of representing accuracy is (Goutte and Gaussier [2005]): Accuracy = 3.1.3 TP + TN . TP + TN + FP + FN (3.15) PentaPen: A Comprehensive Workflow Workflow 1 summarizes the steps to develop PentaPen, using five penalized models to detect potentially important SNPs. The data are fitted to train and validate the models as explained in Section 3.1.1. Note that the groups of SNPs that are used in Group LASSO and SGL are created using hierarchical clustering (Nielsen and Nielsen [2016]). As described in Section 3.1.1, the evaluation metrics, predictions, and selected SNPs from the models are recorded across the 5-fold. The reasons for using SNP Pooling are as follows: the results of SGL and Group LASSO exhibit potential over-fitting and poor model performance when using all SNPs. Training Group LASSO and SGL using all the SNPs require extensive computational resources (the system runs out of memory) for some phenotypes (preliminary results not achievable and not shown here). 47 Workflow 1 PentaPen: Proposed Workflow 1: Input: Genotype .ped file and Phenotype .pheno file 2: Pre-process the data 3: for phenotype in phenotypes do 4: Initialize the folds to 1 5: while folds <= 5 do 6: Split the data into training (80%) and testing (20%) folds. 7: Train and validate Ridge, LASSO, and Elastic Net models using glmnet. 8: 9: for each model do Predict the phenotype value for both training and testing folds. Find λ within 1 standard error of the minimum obtained by an inner 5-fold CV. 10: Record the appropriate performance metric using the optimal cutoff. 11: Record the potentially important SNPs from the model. 12: end for 13: Filter SNPs by taking the union of SNPs from the Ridge, LASSO, and Elastic net. Create groups of SNPs using Hierarchical Clustering. 14: Utilize filtered SNPs and formed groups to train and validate Group LASSO and SGL using R functions grplasso and SGL. Repeat steps 9 and 10. 15: Take the union of the potentially important SNPs from both Group LASSO and SGL. Increment the fold number. 16: end while 17: end for 18: Output: The top 10 important SNPs for each phenotype. 48 A few improvements of PentaPen over the existing literature based on the results are stated as follows: Combining multiple penalized models leads to more reliable results by mitigating the limitations and biases of individual penalized models. It also identifies a potentially reasonable number of SNPs as the same SNPs are likely to be identified by multiple methods, thereby reducing the risk of false negatives. It improves the model’s performance by addressing issues like over-fitting (Hawkins [2004]). It also improves SNP identification (due to SGL’s sparsity). 3.2 Deep Learning for GWAS This section is centered around the use of deep learning models. The various topics covered in this section include the study’s design, which is based on utilizing neural networks. Additionally, the theory behind hypothesis testing and neural networks is discussed. The chapter also proposes a workflow based on a Bayesian neural network (BNN). 3.2.1 Experimental Research Design In this study, we propose BayesDL (a BNN-based workflow) to identify important SNPs. BayesDL is a cascaded classifier and regressor of an Artificial neural network using Bayesian inference. The model is fitted using R and Stan. Firstly, the whole-genome SNP data undergoes preliminary feature (SNP) selection using hypothesis testing procedures to reduce the size of the data. It is needed to make the data compatible with neural network (NN) architectures (especially BNN which also includes probabilistic distributions) 49 as they require extensive computational resources (more than 32GB RAM when using all SNPs as input layer). Secondly, the BNN model is defined using Stan; the defined model is further used for SNP identification and test performance evaluation. Finally, the existing CNN model is utilized for comparison based on test performance metrics which aids in determining the superiority of BayesDL. The final output (important SNPs) from BayesDL is used for further biological analysis. The following paragraphs provide steps followed to perform preliminary feature selection and define the research work required to train and validate the NN-based models. Before conducting analysis, it is necessary to pre-process the SNP data to ensure compatibility with the NNs used in this study. To achieve this, several steps are taken which are similar to the steps mentioned in Section 3.1.1. After completing the pre-processing of the data, preliminary feature selection is carried out using hypothesis tests (see Section 3.2.2.1). Although the independent variables (SNPs) are categorical, the dependent variable (phenotype) can be either categorical/binary or continuous. Due to this, two hypothesis tests are utilized to select the SNPs that are highly associated with the phenotype. The chi-squared test (explained in Section 3.2.2.1.1) is utilized for categorical phenotypes (with two or more categories), while ANOVA (see Section 3.2.2.1.2) is used for continuous phenotypes. The subsection below describes the stages in which preliminary feature selection is performed. 50 3.2.1.1 Preliminary feature selection The preliminary feature selection from the original data is done in the following steps using an R-package, stats (Team et al. [2020]): 1. Perform Chi-square and ANOVA tests to evaluate the association between the response variable (phenotype) and each predictor variable (SNP) under the null hypothesis. 2. Use the test statistic of Chi-square (Eq. 3.16) and ANOVA (Eq. 3.17) to find the p-value of each association. In R, the summary() function and chisq.test()$p.value are used for ANOVA and Chi-squared respectively. 3. Find the p-th quantile (using quantile() function in R) and use the significance threshold appearing within the range of 0.5−2.5% quantile. 4. Record the number of SNPs for various thresholds of the significance level (ranging from 0.1 to 1e − 17). The number of selected features for each phenotype, based on six different thresholds, can be found in Table 3.1. The preliminary feature screening is preferred due to the heavy computational requirements (RAM of at least 32GB) of NNs, especially BNNs. Additionally, this is preferred as NNs perform well when the number of samples (n) is larger than the number of predictors or SNPs (p). 51 Table 3.1: The number of SNPs selected for the corresponding significance level threshold. The chi-square test and ANOVA are used to record these thresholds. The marked number of SNPs are finally used as input of Neural networks. Tests Phenotype Significance level 0.10 0.05 0.01 1e−3 1e−4 1e−17 Anthocyanin 26503 15590 4580 729 118 − Germ 27930 16246 5007 991 238 − Width 26225 13857 3182 408 96 − DTF 83074 69127 47280 29423 19154 96 Chi- square ANOVA Further, the filtered data obtained from preliminary feature selection is split into 50% training and 50% testing folds which are swapped later. Since using a 5-fold CV approach is computationally expensive for NNs specially BNNs, hence, performing a 5-fold CV is out of the scope of this study. In addition, splitting the data in half avoids the model from memorizing the training data, hence, reducing the risk of over-fitting. This splitting also ensures the model’s consistency. Both the NNs are trained and validated on a train-test split and vice-versa for further analysis. The proposed BNNbased model is created using the RStan library by defining various blocks like data, model, and parameter blocks in the Stan file. These blocks are described in detail in Section 3.2.3.1. This model is first trained using the training set and the appropriate performance metrics (see Section 3.1.2.7) of the model are recorded using the testing set in R. Further, the role of the training and testing set is swapped and the BNN model is trained and validated again. Finally, BayesDL output the important SNPs by following 52 the steps described in Section 3.2.3.3. The top 10 SNPs are validated to identify genes and compared with the output of GAPIT and TASSEL. The superiority of BayesDL is confirmed by comparing the performance of the BNN model with an existing NN model. We choose the CNN model for comparison; it is employed using the Keras library to perform a classification or regression task. In the training process of CNN, Convolutional 1D layers are utilized and the performance metrics are generated using a testing set. Similar to the BNN model, the train and test sets are swapped, and the performance metrics based on the test set are recorded. The recorded metrics from the two test sets of the BNN model and CNN model are used for comparison. To make comparisons compatible this study uses the same architecture and activation functions for both the NN models: (a) The architecture of the NN models contains an input layer with p predictors, 2 hidden layers with 50 neurons in each layer, and an output layer. For CNN, an additional parameter is added in the input layer to specify the shape of the input as a p − dimensional vector. To choose the optimal number of hidden layers and neurons, a common approach was employed. Initially, a relatively simple architecture with a small number of hidden layers and neurons was used, and the complexity was gradually increased based on the test performance. The selected architecture provides a good balance between efficiency and complexity as the input data is not too large. (b) The hyperbolic tangent is used in hidden layers for both classification and regression. The softmax activation function is used in the output layer for classification. For regression, the identity or linear function is used as the activation function of the output layer as this function aids to output continuous (or real) values. (c) While 53 the model compilation of CNN, the Adam optimizer is used for all the phenotypes while binary cross-entropy, MSE, and categorical cross-entropy are the loss functions used for the binary, continuous, and categorical phenotype. BayesDL, used for SNP identification, increases confidence in the selected SNPs by leveraging Bayesian methods and deep learning advantages. Also, by providing evidence against the hypothesis that apriori SNPs are insignificant, the proposed workflow aids in increasing confidence in the selected set of SNPs. Unlike a CNN model, which uses a single value for the weights, BayesDL accounts for the uncertainty of those weights by using probabilistic distribution, thereby improving the confidence in the chosen SNP set. Additionally, using BayesDL is beneficial when working with small sample sizes, as it helps to reduce over-fitting, a common issue evident in CNN. Using the BNN model also results in less number of False negatives as compared to that when using CNN. The steps in this study are depicted in the Figure 3.2 schematic diagram. 3.2.2 Methods In this section, the focus is on the theory behind the methods used for NNbased research. The discussion covers several topics, starting with the theory of hypothesis tests that are utilized for preliminary feature selection. This is followed by an exploration of the different types of neural networks, including feed-forward and Bayesian neural networks. The section also delves into the theory behind Monte Carlo Markov Chain (MCMC) and Stan, which are important tools for Bayesian inference. Finally, the use of the Coefficient of Variation (CoV) for post-feature selection using the BNN model is explained 54 in detail. Figure 3.2: Study Design for Deep Learning flowchart. This flowchart depicts the experiment design for deep learning, outlining the key steps involved in building and training neural networks. The flowchart follows the preprocessing till important SNP identification and validation. 55 3.2.2.1 Leveraging Hypothesis Testing for Preliminary Screening A statistical technique called hypothesis testing is used to assess the relationship between genetic variations (SNPs) and a certain disease or trait. It is a key technique for feature selection in genetic research because it enables researchers to identify a subset of genetic markers that are most useful in predicting the desired outcome. Several statistical methods, including machine learning algorithms, logistic regression, and linear regression, can be used to evaluate hypotheses for feature selection (Jin et al. [2018]). The goal of hypothesis testing for feature selection is to identify the most important SNPs that are associated with the phenotype of interest and can be used for further analysis or prediction. To find the genetic variants most likely to be associated with the characteristic of interest, genetic association studies sometimes employ hypothesis testing such as a t-test for feature selection (Zhou and Wang [2007]). Due to the significant computational resources (32GB RAM) required by NN models, in this research, preliminary SNPs from whole-genome SNP data are chosen using the Chi-square and ANOVA tests for binary/categorical and continuous phenotypes, respectively, to construct the input set for these models. The NN models work best with smallp-large-n data, and doing preliminary feature selection solves the problem of high-dimensional data. 3.2.2.1.1 Chi-square Test In 1900, Karl Pearson initially developed a chi-square test for testing the goodness of fit and later worked on the family of chi-square tests. The Chi-square method is a commonly used approach for selecting features when both independent and dependent variables are categorical (two or more than two). It involves evaluating SNPs on an in56 dividual basis with respect to specific classes. However, in order to apply this method to continuous-valued features, the range of values must first be divided into intervals. The Chi-square method for categorical features works by comparing the observed frequency of a class to the expected frequency of that same class, allowing for meaningful feature selection. Let Oi be the number of observed frequencies from the Ci class among the N samples. Oi should occur with expected frequency Ei = |Ci| . Mathematically, the ChiN squared test statistic of an SNP is represented as (Jin et al. [2006]): χ2 = C X (Oi − Ei )2 i=1 Ei (3.16) The null hypothesis of a chi-square test is that the observed frequencies are close to the expected frequencies in a given data set. In other words, the test is used to determine if there is a significant association between two categorical variables. The larger the χ2 value, the more informative the corresponding SNP. 3.2.2.1.2 ANOVA Fisher [1954] developed ANOVA which is a statis- tical technique used to determine how a quantitative dependent variable changes based on one or more independent variables that are categorical in nature. The null hypothesis of an ANOVA test is that there is no significant difference between the means of three or more groups with respect to response in a given data set. In other words, the test is used to determine if there is a significant association between the independent variable (group or category) and the dependent variable (continuous variable). The F-statistic is used to measure the significance of each predictor variable (or feature) in explaining the variation in the response variable. Mathematically, the F- 57 statistic is represented as (Shakeela et al. [2021]): F = SSB/(p − 1) SSE/(n − p) (3.17) where SSB is the sum of squares between groups, SSE is the sum of squares within groups, p is the number of predictors (or features), and n is the sample size. A larger F-value indicates a stronger association between the predictor variables (or features) and the response variable. 3.2.2.2 Neural Networks The neural networks (NNs) that are used for predictions are CNN and BNN. Both networks are fully connected however weights in Bayesian are assigned as probability distributions instead of a single value. To calculate the degree of uncertainty in weights and predictions, probability distributions of weights are utilized. Unlike BNN, CNN uses optimization algorithms to find the optimal set of weights and biases to optimize the loss. Instead, BNN aims to find the posterior distribution that best fits the data which is accomplished by MCMC or variational inference (VI) explained following subsections. BNN is further used for the identification of important SNPs. 3.2.2.2.1 Convolutional neural network The deep learning technique is also known as “multilayer perceptrons” (MLPs) and consists of several layers connected by neurons (Arora et al. [2015]). Figure 3.3 is a visual depiction of a fully connected feed-forward NN, CNN, consisting of several layers (an input layer, m hidden layers, and an output layer). The lines in the figure display the linked neuron which combines the weighted combinations of the inputs to generate the output parametric function. These are nonlinear 58 activation functions that combine linear transformations with a bias that symbolizes the activation threshold of the neuron. Figure 3.3: Fully Connected Feed-forward Neural Network Architecture. This diagram illustrates a neural network with an input layer, m hidden layers, and one output layer. The input, hidden layer, and output weights can be represented as θI , θHi , and θo respectively. bI , bHi , and bo can represent the biases of input, hidden, and output layers respectively. All the layers in this architecture, except the output layer (only when the response variable is standardized), contain bias. The width of the model is the number of neurons in each layer, and the depth of the model is the number of layers in the network. The output of the entire network is determined by the weights and biases connecting the neurons, but it also depends on the network’s width and depth (Uppu et al. [2016]). A group of neurons, 59 xj (j = 1, · · · , p), representing the input features (SNPs) make up the first layer, also referred to as the input layer. For the first layer and i − th neuron, each hidden layer neuron transforms the values from the preceding layer Pp using a weighted linear summation ci = j∈input θji xj + b0 , where θji are corresponding weights of the particular SNP node and b0 is the bias or the intercept of the regression equation; it is followed by the activation function g(ci ) resulting in the output zi (LeCun et al. [2015]). In matrix notation, it can be represented as: Zi = g(θi X) (3.18) where X is a design matrix and θi consists of all the weights and biases. The data set D = {xi , zi } is divided into training and testing sets for implementation. The disadvantage of neural networks’ extreme flexibility is that they are particularly prone to over-fitting. Over-fitting occurs when the learning algorithm performs a good enough job of optimizing the objective function to tune the model parameters for performance on the training set that the performance on new cases declines. When we select network weights that perform well on training data but badly on test data, over-fitting occurs. The issue gets worse as the network’s layers are increased. Over-fitting is mitigated by Bayesian NNs as it can provide estimates of the posterior uncertainty of the model, which can be used to identify when the model is uncertain about its predictions and when it may be over-fitting to the training data (Hernández-Lobato and Adams [2015]). 3.2.2.2.2 Bayesian neural network In Bayesian inference, the objec- tive is to determine the conditional distribution of the weights given the 60 training data. The conditional distribution is represented as p(θi |Dtrain ) which is also known as posterior distribution. Figure 3.4 describes the BNN consisting of an input layer with p predictors, 2 hidden layers with 50 neurons each, and an output layer that predicts a continuous, binary, or categorical phenotype. The optimal number of hidden layers and neurons is chosen by following a common strategy to start with a relatively simple architecture, with a small number of hidden layers and neurons. Gradually the number of hidden layers and neurons is increased, based on the performance of the test data until the metrics’ value except RMSE starts to lie within the range of 0.80 to 1. In this study, architectures with 2 hidden layers and 50 neurons in each layer provide a good balance between efficiency and complexity as the input data is not too large or complex. The input, hidden, and output layers’ weights and biases have posterior distributions that depend on the prior distributions as well. Given the training data Dtr , Bayes’ rule provides the posterior distribution as follows: p(θi |Dtr ) ∝ p(Dtr |θi )p(θi ), where p(Dtr |θi ) is the likelihood of training data provided by the model with parameters θi and p(θi ) is the prior distribution over the parameters (weights and biases). The Bayesian model average (BMA) is then used to provide predictions for a new test example (xnew ) given by: Z p(znew |xnew , Dtr ) = p(znew |xnew , θi )p(θi |Dtr )dθi (3.19) θi where the predictive distribution for given values of parameters θi is p(z|xnew , θi ). The parameters from many models are averaged together to provide the BMA estimate for thetas (or unknown parameters) (Hoff [2009]). Unfortunately, for neural networks, the BMA integral in equation 3.19 cannot be evaluated in closed form; hence, the approximate inference (for instance 61 MCMC; see Section 3.2.2.3) must be used. Furthermore, a complex posterior p(θi |Dtr ) with high dimensions makes approximation in equation 3.19 difficult. Wilson and Izmailov [2020] gives a thorough explanation of Bayesian deep learning. BNNs use BMA to average the predictions of multiple models, each with different values of the model parameters. This can help to reduce the variance of the model and improve its generalization performance, hence, reducing over-fitting (Lakshminarayanan et al. [2017], Neal et al. [2011]). Figure 3.4: Bayesian neural network with an input layer, 2 hidden layers with 50 units each layer, and one output layer. The neural network’s neurons are interconnected by lines, and each line has a weighted distribution in each layer. Additionally, the biases associated with each layer also have distributions. 62 3.2.2.2.3 Activation functions Hyperbolic tan (tanh) is used as the activation function for all the hidden layers. Tanh is symmetric around 0, hence scales the output values in the range [−1, 1]. The non-linear property of tanh captures the non-linear patterns in the data. For classification, the softmax function is used as the activation function for the output layer as it normalizes the output, ensuring that the sum of the probabilities for all classes is equal to 1. This can prevent the NN from biasing towards certain classes and help to improve its generalization performance. For regression, the linear or identity function is used as the activation function. The prior distribution for output weights is a normal distribution that is used to sample from the posterior distribution of the weights during the sampling procedure. Therefore, the output of the BNN for regression would then be computed as a linear combination of the input features and the sampled weights. 3.2.2.2.4 Priors The architecture of BNN used in this study is described in Figure 3.4. Since the weights and biases have distributions in the BNN model, hence the prior distribution for these parameters is the standard normal distribution (mean = 0, SD = 1). The standard normal distribution is one of the most widely used uninformative and conjugate prior. Moreover, the use of a standard normal prior can also help to regularize the model, since the prior effectively adds a penalty term to the likelihood function that encourages smaller weights and biases (also known as weight decay), thereby reducing the risk of over-fitting (Bishop and Nasrabadi [2006]). The BNN model is fitted using the programming language, Stan as explained in Section 3.2.2.3.1. A similar architecture as described above is used for CNN to fit the data with weights and biases being specific values instead 63 of having prior distributions. 3.2.2.3 Posterior Sampling Sampling can be used to approximate the integral in equation 3.19 by takP ing an average over N samples as: p(znew |xnew , Dtr ) ≈ N1 N i=1 p(y|xnew , θi ), where θi ∼ p(θi |Dtr ) are samples taken from the posterior. If simulated over a large number of times, a Markov chain created by Markov Chain Monte Carlo (MCMC) algorithms will produce approximations of samples from the posterior. However, in this study, we focus on the no-U-turn sampler (NUTS) (Hoffman et al. [2014]), an adaptive variant of Hamiltonian Monte Carlo (HMC) (Neal et al. [2011]), which is a generalization of the Metropolis algorithm(MA). HMC is a technique that generates samples that are asymptotically exact provided that the unnormalized posterior density p(Dtr |θi )p(θi ) and its gradient is available (Betancourt [2017]). The MA basic steps are to start somewhere, then propose a value nearby the current value of the iteration from the (symmetric) distribution called the proposal distribution, and then compute the acceptance ratio (or the posterior likelihood ratio = r). 3.2.2.3.1 Software Used A probabilistic programming language called Stan provides full Bayesian inference for a variety of models (Carpenter et al. [2017]). It works by allowing users to construct a probabilistic model using a high-level syntax, and then perform Bayesian inference on that model using MCMC or Variational Inference (VI) (Blei et al. [2017]). MCMC is a simulation-based approach for roughly approximating a model’s 64 posterior distribution. The goal of MCMC is to produce a series of samples from the posterior distribution where each sample depends on the sample before it. The NUTS, an effective and automatically tuned variant of the HMC method, is used in Stan to perform MCMC. HMC is a gradient-based MCMC technique that effectively explores the posterior distribution without getting stuck in local modes by using gradient information. A self-tuning variation of HMC called NUTS dynamically modifies the step size and direction while sampling actually occurs (Hoffman et al. [2014]). Conversely, VI uses a simpler distribution, like a Gaussian, to approximate the posterior distribution and then optimizes its parameters to reduce the Kullback-Leibler divergence from the true posterior (Blei et al. [2017]). Additionally, Stan offers powerful tools for model verification, diagnostics, and visualization, enabling users to comprehend the outcomes of their investigation on a deeper level. 3.2.2.4 SNP Identification using Post-Feature Selection In this study, the coefficient of variation (CoV) is used for SNP identification using the corresponding input weights of the BNN model. It is a useful tool for feature selection in machine learning and statistics. The CoV is a statistical measure that expresses the ratio of the standard deviation (SD) to the mean of a data set. Mathematically, it is represented as (Brown [1998]): CoV = SD mean (3.20) It is commonly used in fields such as biology, engineering, and finance to measure the variability or dispersion of a data set relative to its mean value 65 (Bedeian and Mossholder [2000]). A low CoV indicates that the data points are clustered tightly around the mean, while a high CoV indicates that the data points are spread out widely around the mean. The basic idea is that it can be used to measure the variability of a feature across different samples or observations. If a feature has high variability, it may not be informative for the prediction task and can be safely removed from the feature set. Conversely, if a feature has low variability, it may be informative and should be kept in the feature set (Ertuğrul and Tağluk [2017]). Hence, the minimum CoV is used to define the best feature. One of the advantages of using CoV is that it is a relative measure that is independent of the scale of the data set. One of the disadvantages of using CoV is that it is misleading in cases of negative values and zero. To address the limitations of using CoV, we have implemented a solution in this study. Specifically, we have excluded weights with a mean of zero and calculated the absolute mean to mitigate any misleading results that could arise from negative values or zero. So, mathematically, the CoV of input weights can be represented as: CoV = SD |mean| (3.21) where mean ̸= 0. 3.2.2.5 Performance Metrics As discussed in Section 3.1.2.7 various evaluation metrics are used as there are three types of phenotypes used in this study. For binary phenotypes, classification metrics such as Precision, Recall, F1 score, and AUC are commonly used to evaluate the performance of classification models. On the other hand, 66 for continuous phenotypes, regression metrics like R-squared and RMSE are utilized to assess the goodness-of-fit of regression models. Accuracy is the multi-class classification metric used for the categorical phenotype. 3.2.3 BayesDL: A Comprehensive Workflow Workflow 2 summarizes the steps to develop a Bayesian neural network or the proposed workflow. Load .ped and .pheno data files in R and pre-process both loaded data following the steps from Section 3.1.1. The pre-processed data is used for preliminary feature (SNP) selection which is performed using Chi-square and ANOVA tests as explained in 3.2.1.1. The data that has been filtered is divided equally into two sets - the training set and the testing set. These sets are then interchanged so that the set that was originally used for testing is now used for training and vice versa. This data set is used to train and validate the developed model. To develop the BNN model, two separate .stan files are created for classification and regression because the study includes binary, continuous, and categorical phenotypes. This BNN model is developed in Stan while R is used for prediction/ classification, sampling, and SNP identification. The following subsections describe the tasks carried out in Stan and R to develop BayesDL, and the steps followed for SNP identification. 67 Workflow 2 BayesDL: Proposed Workflow 1: Input: Genotype .ped file and Phenotype .pheno file 2: Pre-process the data 3: if categorical then 4: feature selection using Chi-square test 5: else 6: feature selection using ANOVA 7: end if 8: Split the data into training (50%) and testing (50%) folds. 9: Devolop Stan model using various blocks. 10: if categorical then 11: stan class = stan model(“nn class.stan”) 12: Define a function to compile stan class. 13: else 14: stan reg = stan model(“nn reg.stan”) 15: Define a function to compile stan reg. 16: end if 17: Call Stan’s optimizing and NUTs sampler to obtain point estimates and posterior samples from the compiled model. 18: for phenotype in phenotypes do 19: Fit train and test sets in the compiled model using 2 hidden layers with 50 neurons in each layer. 20: Compute and record test performance metrics. 21: Repeat the above process by reversing the roles of train and test sets. 22: end for 23: Rank the input SNPs by generating the samples of weights (weights ranked in increasing value of CoV) corresponding to predictors using the sampling function. 24: Output: The top 10 important SNPs for each phenotype. 68 3.2.3.1 Stan: Model Specification Since Stan consists of separate blocks for defining the model, to develop BayesDL, we use seven blocks for classification and six blocks for regression. The regression model has data, parameters, function, transformed parameters, model, and generated quantities block except for the transformed data block. The data block defines the parameters as the number of training and testing samples and predictors, feature matrix, outcome integer labels for classification/ vector for regression, the number of hidden layers, and the number of nodes in each hidden layer. The transformed data block for classification consists of an integer value of the number of output labels. All the weights and biases of each layer along with the row vector of the number of labels for classification and a real number σ for regression are defined in the parameter block. The function block comprises m hidden layers, each of which has the hyperbolic tangent (tanh) activation function. The last output layer is output by this prediction function block. For the final output of the output layer’s vector, use the function from the function block to fit the necessary data and parameter values in the transformed parameter block. The model block is used to define the prior distributions and the distribution for the output phenotype. All the weights and biases have a standard normal distribution for both regression and classification. For classification, the distribution of the outcome labels is modeled using a GLM with a categorical likelihood and a logit link function. This function is a softmax function 69 supported by Stan. For regression, the outcome is distributed normally with the mean of predicted output values and standard deviation as σ. Hence, this function is the identity/ linear activation function supported by Stan. Finally, the generated quantities block contains all the test outputs generated using the function model. The softmax and normal variate are used to generate the output for classification and regression respectively. The two files are saved as nn reg.stan and nn class.stan which are further read in R using the RStan library. 3.2.3.2 R: From Sampling to Inference Now the output files from the above section are used in R to specify the Stan BNN model using the stan model object (Gelman et al. [2015]). The steps described below are followed to deploy the defined model. Create a function that builds the appropriate model and produces the desired result. This is done by defining the stan data in the function and calling the optimization and sampling methods. The optimization helps to obtain Precision, Recall, F1-score, AUC for classification, Accuracy for multi-class classification, and R-squared, RMSE for regression. The sampling method aids in drawing posterior samples from the model and selecting important features or SNPs using the corresponding weights of each predictor. After developing BayesDL, the model was trained on the train and test sets using two hidden layers, each consisting of 50 neurons, to make predictions. Record the two test performance metrics from the optimizing method for each phenotype. The sampled weights corresponding to each node be70 tween all the layers are recorded from the sampling method and are further utilized for important SNP selection using the steps in the following section. 3.2.3.3 SNP Identification The following steps are followed to rank the SNPs according to their significance for both regression and classification: 1. Extract all the sampled weights from NUTs sampling corresponding to the predictor. 2. Compute the posterior SD and posterior mean of the weights corresponding to each predictor across all samples or observations. 3. Remove all the weights whose posterior mean value is equal to zero. 4. Calculate the CoV for each sampled weight using equation 3.21. 5. Rank the weights based on their CoV values. Weights with lower CoV values are ranked higher and considered more informative. 6. Extract the top 10 SNPs corresponding to the weights ranked based on the lowest CoV. Based on this analysis, the top 10 SNPs are selected from the BNN model and are used for further analysis. The pairs plot, trace plot, and auto-correlation function plot are used for MCMC diagnosis to assess the significance of the top 2 output SNPs using the shinystan R package (Gabry [2020]). The selected top 10 SNPs are used to identify genes and are compared with the output of GWAS software to draw biological interpretations. 71 This chapter provided a detailed description of penalized-based and NNbased research, including the research design for implementing penalized and NN models, the theory of different penalized and NN models used, and a detailed description of the workflow with improvements. 72 Chapter 4 Results This chapter presents detailed results divided according to two major research methodologies: Machine and Deep Learning for GWAS. The following subsections within the two sections provide results based on three types of phenotypes: binary, continuous, and categorical. 4.1 Machine Learning for GWAS This study evaluates the performance of five penalized models and PentaPen by assessing their metrics displayed in Figures 4.1, 4.2, and 4.3. These figures are formed using Tables A.1, A.2, and A.3 which show the exact values of performance metrics in the Appendix A. Further, the findings from these figures are explained in subsections 4.1.1, 4.1.2, 4.1.3, and 4.1.4. 73 (a) Ridge, LASSO, Elastic Net, and PentaPen using all SNPs as input. (b) Group LASSO and SGL using all SNPs as input. (c) Group LASSO and SGL using pooled SNPs as input. Figure 4.1: Comparison of five penalized models among themselves and with PentaPen using all SNPs as predictors for the binary phenotype, Anthocyanin. Comparison of Group LASSO and SGL among themselves using 74 pooled SNPs. The performance metrics are recorded for both training and testing sets. (a) LASSO, Elastic Net, and PentaPen using all (b) Ridge, Group LASSO, and SGL using all SNPs as input. SNPs as input. (c) Group LASSO and SGL using pooled SNPs as input. Figure 4.2: Comparison of five penalized models among themselves and with PentaPen using all SNPs as predictors for the continuous phenotypes, Width and DTF. Comparison of Group LASSO and SGL among themselves using pooled SNPs. The performance metric, R-squared, displayed in this figure is recorded for both training and testing sets. The evaluation using RMSE can be done using Tables A.1, A.2, and A.3 in Appendix. 75 (a) LASSO, Elastic Net, and PentaPen using all (b) Ridge, Group LASSO, and SGL using all SNPs as input. SNPs as input. (c) Group LASSO and SGL using pooled SNPs as input. Figure 4.3: Comparison of five penalized models among themselves and with PentaPen using all SNPs as predictors for the categorical phenotype, Germination Days. Comparison of Group LASSO and SGL among themselves using pooled SNPs. Accuracy is recorded for both training and testing sets. One of the objectives of this research is to systematically compare each penalized model based on input SNPs to gain insights into their strengths and limitations before including them in the proposed workflow. Further, after forming PentaPen, the superiority of that workflow is eval76 uated using (a) the performance of the workflow against all five penalized models using all SNPs, (b) the number of SNPs selected by each penalized method, and (c) the computational time with every single model (Ridge, LASSO, and Elastic net). The important SNPs and computational time for each physical trait observed in A. Thaliana by the penalized models and PentaPen are displayed in Tables 4.1 and 4.2. The subsections are organized by phenotype type to explain the results in detail. Table 4.1: Number of important SNPs selected by penalized models, SNP Pooling, and PentaPen (the proposed workflow). The SNP Pool was used as the input for Group LASSO and SGL in the proposed workflow using penalized models. Phenotypes Ridge LASSO Elastic SNP PentaPen Net Pooling 77387 90565 58 Anthocyanin 77690 77267 Width 74660 2 12 74660 4 DTF 77665 80 102 77671 77 Germ 74606 2 16 74606 5 77 Table 4.2: Computation time (in seconds) of penalized methods and the workflow using penalized methodologies Phenotypes Ridge LASSO Elastic PentaPen Net Anthocyanin 130 25 26 161 Width 109 29 28 142 DTF 104 79 51 169 Germ 113 27 24 144 When comparing the results of PentaPen, with GWAS software, GAPIT, and TASSEL were used. The Bonferroni correction was utilized in GWAS software output to find the important SNPs for AtPolyDB and F1-hybrids data. With the Bonferroni correction, SNPs were selected ranging from 13 − 97. To make the comparison feasible and consistent among all the phenotypes, the top 10 SNPs are compared with the results of the proposed workflow. Tables 4.3 and 4.4 consist of the top 10 SNPs selected from GAPIT and TASSEL analysis for all the phenotypes for this research. The results for comparison are discussed in the subsections 4.1.1, 4.1.2, and 4.1.3. 78 Table 4.3: Top 10 SNPs for all the phenotypes using GAPIT. The highlighted SNPs are shared SNPs with TASSEL except DTF and Germ having no shared SNPs. Anthocyanin Width DTF Germ Chr1 16930622 Chr4 17569323 Chr4 7749720 Chr3 11018956 Chr1 14007403 Chr3 15742125 Chr4 7749144 Chr4 567760 Chr4 1503923 Chr4 17569775 Chr4 7748780 Chr3 11018472 Chr1 14051264 Chr4 17570680 Chr4 5208011 Chr2 8448286 Chr1 14142053 Chr3 11052540 Chr4 5208139 Chr2 8466240 Chr1 14053491 Chr3 3504771 Chr3 7164713 Chr1 17371303 Chr1 14028301 Chr4 17572283 Chr3 7164675 Chr5 7147445 Chr1 14015352 Chr4 17571262 Chr3 17869230 Chr4 6786088 Chr2 18811115 Chr3 3687421 Chr3 5693257 Chr1 6868571 Chr3 2602503 Chr3 16092490 Chr1 1327977 Chr5 6100537 Table 4.4: Top 10 SNPs for all the phenotypes using TASSEL. The highlighted SNPs are shared SNPs with GAPIT with the exception of DTF and Germ having no shared SNPs. Anthocyanin Width DTF Germ Chr1 12663142 Chr1 4636782 Chr4 10151382 Chr3 11018956 Chr1 16930622 Chr1 4637778 Chr4 12166531 Chr5 18308060 Chr1 14028301 Chr3 15742125 Chr4 12166556 Chr5 24425444 Chr1 14053491 Chr4 17569323 Chr3 2176300 Chr3 11018472 Chr1 14142053 Chr1 7431064 Chr3 2176302 Chr1 17371303 Chr1 8722354 Chr4 17570680 Chr1 22979953 Chr5 9414526 Chr1 14007403 Chr1 9804463 Chr2 526009 Chr4 8590325 Chr2 18677631 Chr4 17569775 Chr2 526254 Chr1 21874586 Chr1 13378174 Chr3 11052540 Chr1 23692187 Chr5 24431218 Chr1 19430888 Chr4 17572283 Chr1 5552218 Chr5 24432490 79 We further validate the results of PentaPen to locate genes from the SNPs. This was carried out by locating the corresponding chromosome and its base pair position of the particular SNP. The corresponding gene was then recorded using the Gene Model from the TAIR website. One instance of the identified SNP and its corresponding gene can be represented from the TAIR website 3 . The Tables 4.5, 4.6, 4.7, and 4.8 show the top 10 SNPs selected from the proposed workflow for each phenotype with their Chromosome base pair positions, corresponding gene, and gene function. The findings from these Tables are discussed in the following subsections organized based on phenotypes. 4.1.1 Binary Phenotype The performance metrics of Anthocyanin to perform classification for train and test sets are Precision, Recall, F1-score, and AUC. The performance of the model enhances with the increase in the value of performance metrics. However, the wider gap between the values in the train and the test sets suggests over-fitting in the model. Although there is no generally acceptable gap that shows over-fitting in the model, in this study anything above 10% difference is considered as potential over-fitting due to the lesser samples of the data set (Park et al. [2021]). From Figure 4.1, the observations are discussed while the exact numbers are listed from the Tables A.1, A.2, and A.3 in the Appendix. From Figures 4.1a and 4.1b (Tables A.1 and A.2 in the Appendix), it can 3 https://jbrowse.arabidopsis.org/index.html?data=Araport11&loc=Chr1%3A1 217530..1221059&tracks=TAIR10_genome%2CA11-GL-Oct22%2CA11-PC-Oct22%2CSAL K_tDNAs&highlight=. 80 be observed that among the five penalized models, Ridge, Group LASSO, and SGL have larger Precision, Recall, and F1-score values (0.88 to 0.99 for train and 0.77 to 0.85 for the test). However, the larger gap (0.11 to 0.23) between the train and test set shows potential over-fitting in Group LASSO and SGL. For Ridge, the gap (0.07 to 0.10) does not show much evidence of over-fitting. For LASSO and Elastic net, the Precision, Recall, and F1-score values (0.74 to 0.77 for both train and test) and gap (0.006 to 0.019) between them for train and test set are almost similar. Therefore, we think that both models perform similarly based on the evaluation metrics and no evidence of overfitting. The AUC values were larger (1 for train and 0.895 for test) for Ridge as compared with that for LASSO and Elastic net (having values as 0.942 & 0.977 for train and 0.897 & 0.959 for the test) but the difference was also more (0.10) for Ridge than LASSO and Elastic net (0.018 and 0.045). The test AUC values for Group LASSO (0.499) and SGL (0.516) show that the models are unsatisfied classifiers whereas Ridge’s AUC score depicts that it is a good classifier. Finally, PentaPen has similar Precision (0.849 for the train and 0.75 for the test), Recall (0.806 and 0.701 for the train and test), and F1-score (0.827 and 0.725 for the train and test) to LASSO and Elastic net using all SNPs. It also shows a similar gap (0.09 to 0.10) to Ridge using all SNPs between train and test sets while a smaller gap than Group LASSO and SGL using all SNPs. The proposed workflow with a high AUC score for both its training and testing phases indicates its capability to classify data accurately. Further, the PentaPen reduces potential over-fitting by decreasing the gap when compared with Group LASSO and SGL using all SNPs. From Table 4.1, for instance, for Anthocyanin, Ridge accounts for 77690 81 SNPs out of all the SNPs before SNP pooling. This might be because the Ridge evaluates each SNP individually. For binary phenotype (Anthocyanin), LASSO and Elastic Net record a similar number of SNPs as Ridge because the SNPs may be more correlated to the phenotype. PentaPen produces 58 SNPs for Anthocyanin. Ridge, LASSO, and Elastic Net identifies SNPs in the range of 77000 to 77700. However, due to the sparsity in SGL, PentaPen identifies a reduced number of important SNPs. According to the results from Tables 4.3 and 4.5, the binary phenotype did not show any shared SNPs but had a shared gene between GAPIT and PentaPen’s output. PentaPen (Table 4.5) had one shared gene, AT3G08970, with GAPIT. Tables 4.4 and 4.5 find that the binary phenotype had no shared SNPs or genes when TASSEL and PentaPen’s outputs are compared. This could be due to PentaPen using five different penalized models to find reasonable numbers of important SNPs whereas GAPIT and TASSEL use GLM (Nelder and Wedderburn [1972]). When looking at the two GWAS programs, it showed that TASSEL had 4 shared SNPs (Chr1 16930622, Chr1 14007403, Chr1 14142053, Chr1 14053491) and a shared gene (AT1G36990) with GAPIT. PentaPen selected an SNP through the classification model with a gene, AT1G56650, which could be potentially related to Anthocyanin ( presence of color in plants (Chen et al. [2022])). 4.1.2 Continuous Phenotype The performance metrics used for the continuous phenotypes to perform regression are R-squared and RMSE. The model’s performance is considered better when the difference between the train and test data is minimal, and 82 the values of R-squared and RMSE are higher and lower, respectively. From Figure 4.2, the observations are discussed while the exact numbers are listed from the Tables A.1, A.2, and A.3 in the Appendix. It was observed from Figures 4.2a and 4.2b (Tables A.1 and A.2 in Appendix) that for Width, although Ridge and SGL have higher R-squared train values (0.944 & 0.948) and with low test values (0.694 & 0.081) the difference between their values (0.25 & 0.87) is large. This shows potential over-fitting, leaving Ridge and SGL out of comparison. Although Ridge shows over-fitting, the gap of SGL is larger than Ridge; hence, Ridge overfits less than SGL. Comparing the other three models, Group LASSO has a low R-squared value (0.455 for train and 0.052 for testing), also showing over-fitting. Hence, comparing rest two models, both have approximately the same R-squared values of (0.82 & 0.808 for the train and 0.801 & 0.778 for the test), with a smaller difference (0.019 & 0.03). Using the RMSE values for analysis, it is evident from Figures 4.2a and 4.2b (Tables A.1 and A.2 in Appendix) that although Ridge has smaller values of RMSE but shows potential over-fitting due to a higher difference in train and test values with an exception of DTF from the F1-hybrids data set. Both LASSO and Elastic Net have smaller values for train and testing with a smaller gap between them. Using these results it can be concluded that LASSO and Elastic Net perform similarly. Similarly, for DTF, the superiority of PentaPen is evident from Figures 4.2a and 4.2b (Tables A.1 and A.2 in Appendix) that the R-squared values (0.866 and 0.753 for train and test sets) are higher than that of LASSO and Elastic net using all SNPs with the smaller gap (0.113) as compared with the models (Ridge (0.23), Group LASSO (0.503), and SGL (0.436) using all 83 SNPs) that posses potential over-fitting. Similarly, when compared with each penalized model using all SNPs, RMSE values for PentaPen were lower and the workflow showed reduced signs of over-fitting. Hence, PentaPen tends to perform better for both phenotypes than models possessing over-fitting and inhibits the beneficial properties of all the five penalized models. It is evident from Table 4.1 that for Width and DTF, Ridge accounts 74660 and 77665 SNPs out of all the SNPs before SNP pooling. For Width, LASSO records too few SNPs (2), whereas for DTF, 80 SNPs were recorded. Elastic net records 12 & 102 unique SNPs. Since different phenotypes may be influenced by different sets of key genetic variants leading to a smaller subset of SNPs for Width when compared with DTF. Unlike binary phenotype, the SNPs associated with continuous phenotypes may exhibit weaker correlations, allowing LASSO and Elastic Net to identify a smaller subset of relevant SNPs. Finally, PentaPen produces 4 SNPs, the union of SNPs selected by Group LASSO (3) and SGL (1). For the continuous phenotype (Width), PentaPen showed neither shared SNPs nor shared genes with GAPIT from Tables 4.3 and 4.6. A similar result was noted between TASSEL and PentaPen from Tables 4.4 and 4.6. From Tables 4.3 and 4.4, there were 2 shared SNPs (Chr4 17569323, Chr3 15742125) and 4 shared genes (AT3G43890, AT4G37370, AT3G29075, AT4G37380) when TASSEL and GAPIT are compared for Width while none for DTF. Only one SNP was selected by PentaPen for the width phenotype whose corresponding gene, AT1G65300, could potentially be responsible for plant growth and development (Waqas et al. [2020]). For the remaining continuous phenotype (Table 4.7) of the F1 hybrids data set, PentaPen had one shared SNP (Chr1 1327977) with GAPIT but none with 84 TASSEL. The gene AT2G28305 could potentially promote flowering which represents the function of phenotype DTF (Xiang et al. [2022]). 4.1.3 Categorical Phenotype The accuracy was used as the performance metric for Germination Days. The accuracy values are analyzed first to study the larger value and then look for the smaller gap in train and test values to avoid over-fitting. From Figure 4.3, the observations are discussed while the exact numbers are listed from the Tables A.1, A.2, and A.3 in the Appendix. It was observed from Tables A.1 and A.2 in Appendix or Figures 4.3a and 4.3b that Elastic net and LASSO have less over-fitting than Ridge, Group LASSO, and SGL as the gap between train and test sets using these three is wider (ranging from 0.17 to 0.27). However, Ridge has a smaller gap than Group LASSO and SGL. The accuracy of PentaPen for the test and train set is higher (0.772 and 0.741 respectively). Although the values are less than Ridge, LASSO, and Elastic net using all SNPs, the smaller gap (0.031) shows reduced over-fitting. These accuracy values of the workflow are higher than Group LASSO and SGL’s values, with a smaller gap between the train and test set. Hence, PentaPen utilized the strengths of the five methods and showed high accuracy values while a smaller gap, reducing over-fitting. Ridge accounts for 74606 SNPs out of all the SNPs before SNP pooling. LASSO, Elastic net, and PentaPen record similar SNPs (2, 16, & 5) as continuous phenotype, Width. For the categorical phenotype, germination days (Table 4.8), PentaPen 85 had no shared SNPs and shared genes with GAPIT and TASSEL (Tables 4.3 and 4.4). It showed no shared SNPs or genes when TASSEL and GAPIT are compared. Although the transcription factor of gene AT1G65300 identified by PentaPen could potentially mediate seed germination. 4.1.4 Evaluation of Group LASSO and SGL For binary phenotype (Anthocyanin), similar to the analysis in Figures 4.1a and 4.1b (Tables A.1 and A.2 in the Appendix), among Group LASSO and SGL (from Figure 4.1c or Table A.3 in the Appendix), Group LASSO is preferred for the classification of predictors with higher Precision, Recall, F1score, and AUC values, and the smaller gap in their train and test set. Using the test AUC values, it was noted that Group LASSO and SGL using filtered SNPs were better classifiers than those using all SNPs. Hence, including these two models using filtered SNPs in the proposed workflow, improves the classification accuracy of PentaPen and identifies a reasonable number of SNPs due to a reduced number of false negatives. Even when Group LASSO and SGL use a lesser number of inputs (or SNPs), still Group LASSO is noted to perform similarly to LASSO and Elastic net whereas SGL performs similarly to Ridge. For continuous phenotypes (Width and DTF), similar to the analysis in Figures 4.2a and 4.2b (Tables A.1 and A.2 in Appendix) among Group LASSO and SGL (from Figure 4.2c or Table A.3 in Appendix), Group LASSO is preferred over SGL because of its higher and smaller R-squared and RMSE values, respectively, and a small gap between the train and test set for both the phenotypes. It is evident that Group LASSO and SGL using filtered 86 SNPs (0.403 & 0.867) reduces the gap between the train and test R-squared values more than those using all SNPs (0.058 & 0.155). Including these models reduces the chances of over-fitting by PentaPen. For categorical phenotype (Germination days), from Table A.3 in Appendix or Figure 4.3c Group LASSO outperforms SGL with higher values (0.88 and 0.87 for train and test respectively) of Accuracy with a smaller difference (0.011). The gap is reduced for Group LASSO and SGL using filtered SNPs than those using all SNPs. Even using lesser features (or SNPs), Group LASSO and SGL still perform similarly to Ridge, LASSO, and Elastic Net. 4.2 Deep Learning for GWAS The results of the deep learning research work are presented in the following subsections, which are organized by different types of phenotypes. The study compares the predictive performance of CNN with BayesDL and assess their test metrics presented in Figure 4.4 (refer to Table B.1 in Appendix B for exact values). In this study, both NNs are trained using a 50% train set, and a 50% test set is used for prediction to record performance metrics. Later, the role of the train and test sets is swapped and new test performance metrics are recorded. This results in two test splits- Split 1 and Split 2 - while reporting the findings. The above results aid in assessing the superiority of BayesDL over the existing model. A few more findings of the proposed workflow are noted to gain insights into the advantages of BayesDL. 87 (a) Anthocyanin (b) Width and DTF (c) Germ Figure 4.4: Comparison of BayesDL with a deep learning model, CNN, for all the phenotypes. The two-split test performance metrics are recorded for comparison. 88 Table 4.9 displays the computation time of BayesDL, for each physical trait, observed in A. Thaliana which aids in noting that in spite of the extensive computational time required by Bayesian when combined with deep learning, the workflow has advantages over existing deep learning models. Table 4.9: Computation time (in seconds) of proposed workflow based on BNN across all phenotypes Anthocyanin Width DTF Germ 2843 3097 3540 3384 The preliminary posterior samples check was aimed to make comparisons between the observed prior and the posterior draws from BayesDL. The null hypothesis is that apriori the SNPs (or features) are insignificant with their corresponding weights. The result for Anthocyanin, Width, DTF, and Germination Days is displayed in Figure 4.5. These findings aid in concluding that the SNPs are identified with more confidence as the prior resulted in an important posterior of the SNP. The top 10 SNPs output from BayesDL is validated to locate corresponding genes. Similar to the validation done in the penalized-based study, it was done by locating the chromosome base pair position of the particular SNP and then its corresponding gene using the Gene Model from the TAIR website. Tables 4.10, 4.11, 4.12, and 4.13 display top 10 SNPs for each phenotype and their Chromosome base pair positions, corresponding gene, and gene function. These are used to find shared SNPs and genes with GAPIT and TASSEL (Tables 4.3 and 4.4). 89 (a) 1st SNP for Anthocyanin (b) 2nd SNP for Anthocyanin (c) 1st SNP for Germ (d) 2nd SNP for Germ (e) 1st SNP for Width (f) 2nd SNP for Width (g) 1st SNP for DTF (h) 2nd SNP for DTF Figure 4.5: Plots for prior and posterior samples. The green line indicates the prior distribution while the purple line refers to the distribution of posterior samples. 90 4.2.1 MCMC Diagnostics To determine the proposal values, the NUTS (an adaptive form of MetropolisHastings) algorithm was run for a total of 2000 iterations where the first 1000 iterations were used as burn-in. Four different chains were generated for the 2000 iterations where values from the first 1000 iterations were thrown away. Three plots were plotted for all the selected phenotypes using the weights as functions corresponding to the top 2 selected predictors (SNPs). Figures 4.6, 4.7, and 4.8 display the output of all the selected phenotypes. Using RStan, the auto-correlation function (ACF) plots for each phenotype for all chains separately and each chain 35-lags apart can be observed in Figure 4.6. It can be observed that the generated MCMC iterations are reasonably correlated for Anthocyanin, Germination days, and Width with certain small dips few lags apart when considering each chain separately. This indicates that all four chains are highly convergent for these three phenotypes displaying the SNPs as significant features to be included for further analysis. While from Figure 4.6d it is evident that all the chains are positively convergent and all chains reached stationarity for SNP42751. However, for SNP31600, chains 1, 2, &4 reach stationarity within 10-lags apart whereas chain 3 reaches stationarity towards the end of the lags after some gradual geometrical decline. The trace plots of the generated parameters were recorded to check whether the Markov Chains reached stationarity to show the parameters’ convergence to an optimal value. The convergence of the chains for the top 2 SNPs of all phenotypes was verified using RStan. The trace plots after the 1000 iterations or after the burn-in was documented. Figure 4.7 shows that 91 (a) Anthocyanin (b) Germination Days (c) Width (d) DTF Figure 4.6: Auto-correlation functions using RStan against MCMC iterations. Functions of weights corresponding to the top 2 predictors (or SNPs) for all the phenotypes. The x-axis and y-axis represent Lags and Autocorrelation values respectively. the stationarity for MCMC iterations was achieved for both the parameters (or SNPs) in Anthocyanin, Width, DTF, and Germination days. For all the phenotypes, all the chains are mixed well and contribute to stationarity. However, for DTF (see Figure 4.7d), the chain gets stuck at the 200 and 900 iterations. It is suggested to introduce more iterations and thinning in the MCMC model to improve the convergence. 92 (a) Anthocyanin (b) Germination Days (c) Width (d) DTF Figure 4.7: Trace plots using RStan against post-warmup MCMC iterations. Here are the functions for the top 2 predictors (or SNPs) across all phenotypes and the y-axis represents the value of weights corresponding to each SNP. The trace plots assess the convergence, stability, and distribution of the weights throughout the MCMC sampling process. This study was dealing with RStan hence, histograms in Figure 4.8 display the distributions for the features (or SNPs) using hyper-parameter values. This aids in making a reasonable comparison of whether the model gives the approximate posterior distributions. The parameters for Anthocyanin, Germination days, Width, and DTF have close enough symmetrical distributions. If the SNPs have a weight value of 0 and they lie in the low-density 93 region, then those SNPs are more important. In fact, the scatter plots in Figure 4.8 show that these parameters are not correlated and were randomly chosen for all the phenotypes. (a) Anthocyanin (b) Germination Days (c) Width (d) DTF Figure 4.8: Posterior uni-variate distributions along the diagonal and bivariate distributions along the off-diagonal using RStan against MCMC iterations. These functions represent the top 2 predictors (or SNPs) for each phenotype. The x-axis for each posterior distribution is the value of weights corresponding to the input. Whereas the scatter plot helps to check the correlation between the SNPs. The potential scale reduction factor (PSRF) is computed by a popular statistic, Rhat. This compares between and within-chain estimates to con94 firm the convergence. The R-hat values less than 1.05 indicate that the chains mix well (Luo and Jiao [2018]). The R-hat values for the selected SNPs of Anthocyanin, Germination days, Width, and DTF are 1, 1, 1.01, and 1.01 respectively. Looking at these values, it can be concluded that the model convergence was reached for all the phenotypes. Finally, the MCMC diagnosis indicates that important SNPs were identified for all phenotypes. 4.2.2 Binary Phenotype Figure 4.4a or Table B.1 in Appendix shows the two test evaluation metrics for making comparisons of deep learning methodologies (NNs). The larger the value of the metric, the better the model performs for the phenotype. Reducing the difference between testing values indicates that the model is avoiding over-fitting. For Anthocyanin, it is clearly evident that BNN outperforms CNN with high values of Precision (0.87 and 0.8708), Recall (0.8656 and 0.8531), F1-score (0.8677 and 0.8607), and AUC (0.8749); BNN has a lesser gap in both splits. A high recall score indicates less number of False negatives which is an advantage of using BayesDL over CNN. It can be noted from Figures 4.5a and 4.5b that both SNPs from Anthocyanin’s BayesDL results form analogous symmetric distributions with a large shift from the standard normal distribution (the prior), depicting that both the SNPs (or features) are important. This concludes that data provided evidence against the null hypothesis and aposteriori the SNPs (or features) are important for binary phenotypes. 95 Table 4.10 shows the selected SNPs output from BayesDL for Anthocyanin. This binary phenotype showed Chr4 1503923 and AT4G03415 as a shared SNP and gene respectively with GAPIT which is evident from Tables 4.3 and 4.5. However, there were no shared SNPs or genes with TASSEL (see Tables 4.4 and 4.10). The genes AT5G57670 (Luo et al. [2017]), AT4G03415, and AT4G04920 (Iorizzo et al. [2019]) regulate the production of Anthocyanin in Arabidopsis thaliana. 4.2.3 Continuous Phenotype R-squared and RMSE are used for comparing the performance of the regression NNs; their values are recorded in Figure 4.4b or Table B.1 in Appendix. The higher and lower values of R-squared and RMSE respectively, the better the performance of the NN model. A wider gap between split values increases the risk of over-fitting. For both phenotypes, it is noticed that BayesDL outperforms CNN. For instance, looking at Width values, R-squared is higher for BayesDL (0.8313 and 0.836) than for CNN (0.5503 and 0.6305) with a lesser gap (0.0047) in both the splits. The RMSE values are lower in BNN (1.4254 and 1.5794) than those recorded for CNN (1.9195 and 2.0317) with CNN RMSE values’ gap (0.1127) larger than that for BayesDL (0.154). Figures 4.5e, 4.5f, 4.5g, and 4.5h demonstrate that BayesDL results for Width and DTF have symmetric distributions that deviate significantly from the standard normal distribution, indicating that both SNPs (features) are important. These findings provide evidence against the null hypothesis and support the significance of the SNPs (features) for all continuous phenotypes. Tables 4.3, 4.11, and 4.12 display no shared SNPs or genes for any of 96 the two continuous phenotypes. A similar result was noted from Tables 4.4, 4.11, and 4.12 when compared with TASSEL. This is a possibility as GAPIT and TASSEL use GLM whereas BayesDL uses probabilistic models having various distributions for SNP identification. The results of the BNN model for Width and DTF are displayed in Tables 4.11 and 4.12 respectively. The genes AT1G51120 and AT4G09350 are responsible for the continuous development of leaf width (Franco-Zorrilla et al. [2014]). The most relevant gene for DTF is AT1G59940 which is responsible for flowering in the species (Hwang et al. [2002]). 4.2.4 Categorical Phenotype Figure 4.4c or Table B.1 in Appendix records the two test accuracy scores to make comparisons of NN methods for germination days. The higher the accuracy, the better the performance of the model. For the categorical phenotype, the BayesDL exhibits superior performance compared to the CNN, achieving two split accuracy scores of 0.9241 and 0.9334. Figures 4.5c and 4.5d display that both SNPs from Germination days’ BayesDL results have symmetric distributions with a large shift from the standard normal distribution. This finding provides evidence against the null hypothesis, concluding that the SNPs are important features. Table 4.13 displays the results of categorical phenotype. It neither showed a shared SNP nor a shared gene with GAPIT and TASSEL (see Tables 4.3 and 4.4). However, this phenotype contributes to the germination of the plant and a few genes (AT1G03890, AT3G31950, and AT4G35950) were found responsible for it (Nagano et al. [2001], Maruta et al. [2021]). 97 Table 4.5: SNP Validation for Anthocyanin. The table displays the top 10 SNPs reported by PentaPen. Bold values of genes and SNPs have the true function of the phenotype. SNPs Chr bp Gene SNP41810 Chr1 24362213 AT1G65540 Gene Function LETM1-like protein. No function found. TAIR. SNP18139 Chr1 10547614 AT1G30070 SGS domain-containing protein. No function found. TAIR. SNP25069 Chr1 14026469 AT1G36980 transmembrane 50A-like protein. No function found. NCBI. SNP81693 Chr3 725078 AT3G03140 histone binding, protein binding. NCBI. SNP34101 Chr1 21234054 AT1G56650 Production of Anthocyanin Pigment. TAIR. SNP35159 Chr1 21692775 AT1G58390 Disease resistance protein (CC-NBS-LRR class) family. ADP binding. TAIR. SNP36875 Chr1 22492447 AT1G61070 Predicted to encode a PR (pathogenesis-related) pro- tein. TAIR. SNP26674 Chr1 16547538 No gene − SNP85074 Chr3 2738646 AT3G08970 Can compensate growth defect. for the Also shows similarity to HSP40 proteins and is induced by heat stress. NCBI. SNP45199 Chr1 26199451 No gene − 98 Table 4.6: SNP Validation for Width. The table displays the top 10 SNPs reported by PentaPen. The SNPs and genes which are bold, show the true function of Width. SNPs Chr bp Gene Gene Function SNP51671 Chr1 30250888 No gene − SNP41509 Chr1 24255390 AT1G65300 DNA-binding transcription factor activity. Gene Ontology. SNP18948 Chr1 11016927 No gene − SNP98767 Chr3 9842540 No gene − 99 Table 4.7: SNP Validation for DTF. The table displays the top 10 SNPs reported by PentaPen. The highlighted genes and SNPs show the true function of DTF. SNPs Chr bp Gene SNP47857 Chr1 25311984 AT1G67540 Gene Function Protein Associated with Lipid Droplets. Gene Ontology. SNP11327 Chr1 7300804 AT1G20950 Phosphofructokinase family protein. ATP binding. NCBI. SNP10090 Chr1 6608183 AT1G19120 mRNA. No function found. TAIR. SNP76193 Chr2 12721604 AT2G29790 Encodes a Maternally expressed gene (MEG) family protein. NCBI. SNP30293 Chr1 19149817 AT1G51640 enables phosphatidylinositol4,5-bisphosphate binding. TAIR. SNP58484 Chr2 835205 No gene SNP74899 Chr2 12082004 AT2G28305 − enables hydrolase activity. TAIR. SNP2195 Chr1 1219863 AT1G04490 hypothetical protein. No function found. TAIR. SNP102004 Chr3 9590645 AT3G26200 mRNA. iron ion binding. NCBI. SNP130361 Chr4 5198089 AT4G08250 DNA-binding transcription factor activity. TAIR. 100 Table 4.8: SNP Validation for Germination Days. The table displays the potentially important SNPs reported by PentaPen. The gene and SNPs showing the true function of the phenotype are highlighted. SNPs Chr bp Gene Gene Function SNP51671 Chr1 30250888 No gene − SNP41509 Chr1 24255390 AT1G65300 DNA-binding transcription factor activity. Gene Ontology. SNP13281 Chr1 7797023 AT1G22090 Protein of unknown function. Gene Ontology. SNP34187 Chr1 21325557 AT1G57580 F-box family protein of unknown function. NCBI. SNP62673 Chr2 7129923 AT2G16440 Protein binding and DNA replication. NCBI. 101 Table 4.10: SNP Validation for Anthocyanin. The table displays the top 10 SNPs reported by BayesDL. The highlighted genes and SNPs are found to show characteristics of Anthocyanin. SNPs Chr bp Gene Gene Function SNP89113 Chr3 5158241 No gene − SNP207626 Chr5 23362168 AT5G57670 Enables protein ser- ine/threonine kinase activity. TAIR. SNP27368 Chr1 16970522 AT1G44900 A protein essential to embryo development. Overexpres- sion results in altered root meristem function. NCBI. SNP129572 Chr4 2500993 AT4G04920 Enables transcription coregulator activity. TAIR. SNP161227 Chr5 205246 AT5G01510 Protein of unknown function. NCBI. SNP25224 Chr1 14138318 AT1G37113 Hypothetical protein. No function found. NCBI. SNP197173 Chr5 18231342 AT5G45110 NPR1-like protein 3. En- ables identical protein binding, protein binding, and salicylic acid binding. TAIR. SNP89096 Chr3 5144422 AT3G15280 No function found. NCBI. SNP27389 Chr1 16980256 AT1G44910 Enables RNA binding, RNA polymerase binding. NCBI. SNP127389 Chr4 1503923 AT4G03415 Protein phosphatase interacts with AGB1 and is localized to the plasma mem102 brane. TAIR. Table 4.11: SNP Validation for Width. The table displays the top 10 SNPs reported by BayesDL. The bold values of SNPs and genes are responsible for the observed trait. SNPs Chr bp Gene Gene Function SNP210343 Chr5 24959340 No gene − SNP56046 Chr2 2309063 AT2G05970 F-box family protein with a domain of the unknown function (DUF295). TAIR. SNP52086 Chr2 100349 No gene SNP30943 Chr1 18938157 AT1G51120 − Enables DNA-binding transcription factor activity. TAIR. SNP134906 Chr4 5931715 AT4G09350 Chaperone DnaJ-domain superfamily protein. No function found. NCBI. SNP57073 Chr2 2994278 AT2G07212 Transposable element gene. No function found. NCBI. SNP51689 Chr1 30259387 AT1G80480 Plastid transcriptionally active 17. No function found. TAIR. SNP178757 Chr5 9128487 AT5G26130 CAP (Cysteine-rich secretory proteins). No function found. NCBI. SNP144027 Chr4 9309784 No gene SNP100922 Chr3 10838525 AT3G28840 − Hypothetical (DUF1216). protein No tion found. TAIR. 103 func- Table 4.12: SNP Validation for DTF. The table displays the top 10 SNPs reported by BayesDL. There is only one gene or SNP that is associated with DTF, which is highlighted in the table. SNPs Chr bp Gene Gene Function SNP42751 Chr1 23756683 No gene − SNP31600 Chr1 19709266 AT1G52910 SNP96314 Chr3 6498906 No gene − SNP198764 Chr5 22834746 No gene − SNP51715 Chr1 27340888 No gene − SNP57031 Chr2 317144 AT2G01720 No function found. TAIR. Ribophorin I. enables protein binding. NCBI. SNP50026 Chr1 26410107 AT1G70130 Enables kinase activity, protein serine/threonine kinase activity. TAIR. SNP194264 Chr5 20105999 AT5G49540 Rab5-interacting family protein. No function found. NCBI. SNP38498 Chr1 22602399 No gene SNP36437 Chr1 22066825 AT1G59940 − phosphorelay response regulator activity, protein binding. TAIR. 104 Table 4.13: SNP Validation for Germination Days. The table displays the top 10 SNPs reported by BayesDL. The highlighted genes and SNPs are found to be associated with the phenotype. SNPs Chr bp Gene SNP80296 Chr2 19554169 AT2G47700 Gene Function Enables protein binding, ubiquitin-protein transferase activity. TAIR. SNP77096 Chr2 17432976 AT2G41790 Enables metalloendopepti- dase activity. NCBI. SNP152576 Chr4 13295619 No gene SNP6672 Chr1 3990189 AT1G03890 − Enables nutrient reservoir activity. TAIR. SNP189759 Chr5 15364726 AT5G38386 F-box/RNI-like superfamily protein. No function found. NCBI. SNP7494 Chr1 4392379 AT1G12890 Enables DNA-binding transcription factor activity, protein binding. TAIR. SNP76252 Chr2 16915416 No gene SNP105241 Chr3 12951052 AT3G31950 − Nucleic acid-binding/zinc ion-binding protein. TAIR. SNP158329 Chr4 17025614 AT4G35950 GTP binding, GTPase activity, protein binding, protein kinase binding. NCBI. SNP80122 Chr2 19469107 No gene − 105 Chapter 5 Discussion and Conclusion Two workflows are proposed to find the important SNPs. Users can identify reduced numbers of SNPs using the PentaPen (a penalized-based workflow). By using the optimal number of hidden layers and units without changing their activation functions for each run, researchers can employ BayesDL (a BNN-based workflow) to identify important SNPs. The SNPs output from both the workflows were utilized to find the shared SNPs with each other and GWAS software (GAPIT and TASSEL). Using the final top 10 SNPs from both the workflows, the SNP validation was carried out. The superiority of PentaPen and BayesDL was confirmed by comparing the performance of the proposed workflows with some of the existing models based on performance metrics. The study provides a guideline for the researchers to choose an appropriate workflow for their research with proposed workflows in terms of identified SNPs, data dimensionality, model complexity, and prior distributions (in BayesDL). 106 5.1 Identified SNPs This section covers the biological interpretations and conclusions of the identified SNPs from individual penalized models, PentaPen, and BayesDL. As reported in Table 4.1, penalized models using all SNPs report a larger number of SNPs for the binary, continuous, and categorical phenotype than PentaPen. This can be a relevant observation since Ridge and LASSO models, hence, Elastic net, are dependent on trait architecture (Seymour et al. [2016]). Ridge shrinks the features’ coefficient to 0 but does not reduce the features. Due to the sparsity in SGL, the union of SNPs selected from Group LASSO and SGL gave a reduced number of important SNPs; this is more promising because PentaPen identifies SNPs by combining the essential properties of feature selection from all the five models. PentaPen helps to reduce the impact of noise and variability in the data, making PentaPen less susceptible to over-fitting. It can be concluded that PentaPen selects reasonable numbers of SNPs compared to single penalized models as they either give too many (like in Ridge) or too few (as in SGL) SNPs which can not be useful for biological interpretation. Hence, PentaPen was able to leverage the beneficial properties of the five penalized models by choosing SNPs and showing similar prediction performance when compared with the five models. According to the results, the final set of outputs from both the proposed workflows had neither shared SNPs nor shared genes when compared with each other. This is possible due to various reasons: different statistical models, different numbers of input predictors, and model complexity. PentaPen’s output did not show shared SNPs or genes with GAPIT or TASSEL 107 for Width (continuous phenotype) and Germination days (categorical phenotype); whereas, there was a shared gene and SNP with GAPIT but not with TASSEL for Anthocyanin (binary phenotype) and DTF (continuous phenotype) respectively. BayesDL’s output for Anthocyanin (binary phenotype) showed a shared SNP and gene with GAPIT but not with TASSEL while other phenotypes did not have shared SNPs or genes with both GWAS software. However, there were SNPs and genes shared between GAPIT and TASSEL for Anthocyanin (binary phenotype) and Width (continuous phenotype) except for DTF (continuous phenotype) and Germination days (categorical phenotype). This is possible because GAPIT and TASSEL employ GLM, whereas PentaPen uses five distinct penalized approaches, and BayesDL utilizes various probabilistic models and distributions to identify the potentially important SNPs. Hence, it is advised to locate the potentially important SNPs using both the GWAS software and our study workflows because they use different statistical methods, pre-processing steps, hyper-parameter choices, and the inclusion of prior knowledge. PentaPen, BayesDL, and both GWAS software possessed some of the genes necessary for the phenotypic traits found in Arabidopsis thailana species. These techniques can be applied to a phenotype of this species and are thought to be complementary to one another. Further, it is also recommended to analyze multiple instances of the same phenotype (binary, continuous, or categorical) from the workflows to increase the reliability of results for identifying shared SNPs or genes with GWAS software. The SNP validation proved that PentaPen and BayesDL were able to find a few promising SNPs which showed the function of that phenotype; this concludes that both the workflows perform similarly in terms of identified SNPs. 108 5.1.1 Linkage Disequilibrium The non-random association of alleles at two or more loci is called Linkage disequilibrium (LD). The LD was not removed while pre-processing the data due to the removal of SNPs with high LD during the SNP calling step. A few identified SNPs from PentaPen and BayesDL resulted in LD with each other. For example, in DTF (continuous phenotype) Chr4 5198089 is highly correlated with Chr1 26410107. Since LD is measured as squared correlation, this leads to redundant information from the cluster of highly correlated SNPs. These SNPs may not contribute independently to the workflow’s performance which further impacts the interpretability of the results. Despite the mentioned limitations, LD patterns reflect the genetic architecture of the studied population and inform about the genetic regions with high LD which further aids in identifying rare variants. However, it was found that among the SNPs identified by PentaPen for Anthocyanin (binary phenotype), there were only two highly correlated SNPs (Chr1 22492447 and Chr3 2738646). Similarly, for Anthocyanin, BayesDL identify two highly correlated SNPs (Chr3 5158241 and Chr5 205246). For other phenotypes, both PentaPen and BayesDL do not identify any highly correlated SNPs. 5.2 Models’ Performance This section covers the interpretation and conclusion of each penalized model, PentaPen, and BayesDL. The rigorous comparison of penalized models using all SNPs depicts that 109 Ridge outperforms Group LASSO and SGL whereas LASSO and Elastic net should be preferred over Ridge, Group LASSO, and SGL for classifying/predicting the groups/responses accurately. LASSO and Elastic Net demonstrate the efficacy of a good classification/regression model evident from the findings across all the phenotypes. Okser et al. [2014] also showed that LASSO and Elastic Net have similar prediction behavior for two wholegenome SNP data. However, Romagnoni et al. [2019] found that Ridge, LASSO, and Elastic Net provide similar results with optimized evaluation metrics on the Immunochip data set. For all the phenotypes, the test scores of Group LASSO and SGL using all SNPs indicate that they perform worse than Ridge. Including Ridge in the union for the input set of Group LASSO and SGL using filtered SNPs provides a larger set of input variables which helps reduce the risk of over-fitting and improve the model’s performance. Between the results of Group LASSO and SGL using filtered SNPs, the former generally perform better, as shown in Table A.3 in Appendix or Figure 4.3. But SGL using filtered SNPs has group-wise and within-the-group sparsity (Simon et al. [2013]) and can potentially benefit PentaPen. SGL was discovered to provide comparatively lower testing metrics value than Group LASSO. Instead, Group LASSO using filtered SNPs offered less variance between the values of the training and testing parameters. It was also noticed that even if Group LASSO and SGL use a lesser number of inputs (or SNPs) when trained to get the final output of PentaPen than other penalized methods using all SNPs, they perform equally well as Ridge, LASSO, and Elastic net & even better than Group LASSO and SGL using all SNPs. This finding leads us to the conclusion that PentaPen uses models that reduce the bias and variance of their models, and obtain more stable estimates of the coefficients. 110 It was found that PentaPen performs similarly to Ridge, LASSO, and Elastic Net based on its evaluation metrics while producing reduced numbers of SNPs for the binary phenotype. For continuous and categorical phenotypes, PentaPen still performs similarly to LASSO and Elastic Net. However, for binary phenotype, PentaPen outperforms Group LASSO and SGL using all SNPs while reducing over-fitting evident in Group LASSO and SGL. For continuous and categorical phenotypes, PentaPen outperforms Ridge along with Group LASSO and SGL using all SNPs. This further guarantees that PentaPen could combine the strengths of five models to make better predictions and identify SNPs with confidence. Hence, based on the performance metrics calculated in this study, it can be concluded that PentaPen is superior to some of the existing methodologies with the advantages of combined strengths of the models used. A thorough comparison of deep learning approaches depicts that the BNN model should be chosen above CNN, for both classification and regression. Similar superiority of BNNs over traditional models was noted by Beam et al. [2014]. Bayesian combined with deep learning is advantageous in various scenarios because they provide a framework for probabilistic modeling that allows for uncertainty quantification. BayesDL’s averaging of samples from the posterior distribution reduces variability and over-fitting, unlike traditional deep learning models that output a single prediction. This property enhances generalization to new data, making BayesDL more reliable than traditional models in limited sample size instances (Gal and Ghahramani [2016], Blundell et al. [2015]). It can also be concluded that with fewer samples in phenotypes, the BNN model was advantageous over CNN. The possible reasons for the better performance of BayesDL are that it is flexible in handling sparse data with a small sample size; in contrast, CNN requires 111 complete data (based on the data used in this study, the rows need to be replicated to make it a full data) to handle such data (Deist et al. [2018]). Additionally, BNNs incorporate prior knowledge or constraints, which can be helpful in biological applications where some knowledge about the data is available (Liu et al. [2019]). The use of prior information about the distribution of the parameters help to regularize the model, which can help prevent over-fitting and improve the performance of the model (Ghahramani [2015]). Hence, in conclusion, based on the performance metrics calculated in this study, the BayesDL is superior to CNN with advantages of the properties of Bayesian. 5.2.1 Controlling R-squared Errors: Impact on Performance Controlling R-squared errors during evaluation helps in gaining more reliable estimates of workflows’ performance. There are various ways to control R-squared errors in PentaPen and BayesDL. This can be done through crossvalidation by assessing the average performance of the workflows and mitigating the influence of over-fitting. Although cross-validation was utilized in PentaPen, this technique can be used in BayesDL to assess the impact. Regularization methods are also effective in controlling over-fitting and Rsquared errors. In PentaPen, the penalty terms can be tuned in individual models while in BayesDL, optimized prior variance can be used. This will result in finding a balance between model performance and generalizability. Further, controlling R-squared errors might enable unbiased comparison of different models, prevents over-fitting, and generalize better to unseen data. 112 5.3 Data Dimensionality and Model Complexity Table 5.1 displays the data dimensionality and model complexity of PentaPen and BayesDL (Li et al. [2018b]). Table 5.1: Comparison of two developed workflows based on the data dimensionality and model complexity. Here, n, p, and o denote the number of samples, predictors, and output dimensions or classes of the data set respectively. The number of folds, iterations, chains, hidden layers, and nodes in each layer is represented by k, i, c, m, and N respectively. ppool are the number of SNPs from SNP Pooling. The number of SNPs preliminary selected from the hypothesis test are given by pf ilter . Lastly, P is the total number of parameters of the workflow. Workflows PentaPen Data’s Model’s Total dimen- dimen- sion sion n× p n× p pa- rameters # of itera- Time com- plexity tions 3∗p+2∗ppool 5 O(k ∗p2 ∗n∗i+ n2 ) BayesDL n× pf ilter m× N p + 100 ∗ p + 1000 O(i ∗ c ∗ P ∗ n ∗ 50 ∗ o (post pf ilter ) warmup) Based on the data dimensionality, the study findings suggest various options for researchers who are dealing with whole-genome SNP data. Table 113 5.1 displays that the workflows have different data dimensionality and interested researchers can choose a workflow based on the data they are using. Researchers using all SNPs (or features) as the input may prefer using PentaPen. By simply inputting their data and pre-processing it according to the variability in the nucleotide notations in the data, researchers can obtain the final set of SNPs using PentaPen. However, BayesDL requires significant computational resources (32GB computation RAM) even after preliminary feature selection. Therefore, if researchers have SNPs in the range of 90 to 250, the suggested approach for SNP identification would be to use BayesDL. However, if they have a smaller sample size than 175, then after selecting SNPs less than the number of samples (p