THOMPSON RIVERS UNIVERSITY

Benchmarking penalized regression methods in machine
learning for single-cell RNA sequencing data

By

Bhavithry Sen Puliparambil

A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF
THE REQUIREMENTS FOR THE DEGREE OF
Master of Science in Data Science
KAMLOOPS, BRITISH COLUMBIA

April, 2022

SUPERVISORS
Dr Jabed Tomal
Dr Yan Yan
© Bhavithry Sen Puliparambil, 2022

ABSTRACT
Single-Cell RNA Sequencing (scRNA-seq) technology has enabled the
biological research community to explore gene expression at a single-cell resolution. By studying differences in gene expression, it is possible to differentiate cell clusters and types within tissues. One of the major challenges in
a scRNA-seq study is feature selection from high dimensional data. There
are several statistical and machine learning methods available to solve this
problem but their performances across data sets lack systematic comparison.
In this research, we benchmark different penalized regression methods which
are suitable for scRNA-seq data. Results on four different scRNA-seq data
sets show that Sparse Group Lasso (SGL) implemented by the SGL R package performs better than other methods in terms of area under the receiver
operating curve (AUC). The computation time for different methods varies
between data sets with SGL having the least average computation time.
Based on our findings, we propose a new method for scRNA-seq clustering
which applied SGL on a pre-selected subset of genes. These selected genes
are the union of top important genes from the ridge, lasso, elastic net, and
droplasso methods. The proposed method demonstrates an improvement in
AUC compared to SGL and other methods as well.
Key Words: Single-Cell RNA Sequencing; Machine Learning; LASSO;
Feature Selection; High Dimensional Data; R Programming Language.

ii

ACKNOWLEDGEMENTS
Foremost, I owe a debt of gratitude to my supervisors Dr Jabed Tomal and
Dr Yan Yan for guiding me through this research. Their combined knowledge
is the foundation of this research, and they entrusted me to build on it. Dr
Yan Yan also granted me a lab space at the TRU graduate research lab.
I would like to thank the program coordinator of MSc Data Science program,
Dr Qinglin (Roger) Yu for helping me stay to on track. My sincere thanks to
the faculty at Thompson Rivers University, especially Dr Piper Jackson, Dr
Mohamed Tawhid, Dr Becky Wei Lin, Dr Richard Taylor, Dr Mateen Shaikh,
and Dr Mila Kwiatkowska for listening and encouraging me to do better.
I acknowledge Compute Canada for hosting the 32 GB Linux remote server
which is used for computation in this research. I would also like to acknowledge the funding received for this research from:
1. TRU Internal Research Fund (IRF) awarded to Dr Jabed Tomal, Department of Mathematics and Statistics, Thompson Rivers University, and
Dr Yan Yan, Department of Mathematics and Statistics, Thompson Rivers
University, and
2. Natural Sciences and Engineering Research Council of Canada (NSERC)
fund awarded to Dr Jabed Tomal, Department of Mathematics and Statistics,
Thompson Rivers University, and Dr Yan Yan, Department of Computing
Science, Thompson Rivers University.
It is my privilege to thank my husband Mr Jose Mathews Kanjooparambil
for his friendship and counsel that help me get through everything. Finally,
I thank my parents, brother, and the god for their unconditional love.
iii

Contents

1 Introduction

1

2 Literature Review

4

2.1

High Throughput Sequencing . . . . . . . . . . . . . . . . . .

4

2.2

Single Cell RNA Sequencing . . . . . . . . . . . . . . . . . . .

5

2.2.1

Smart-seq2 . . . . . . . . . . . . . . . . . . . . . . . .

6

2.2.2

Drop-seq . . . . . . . . . . . . . . . . . . . . . . . . . .

6

2.2.3

Chromium . . . . . . . . . . . . . . . . . . . . . . . . .

6

2.2.4

STRT-seq . . . . . . . . . . . . . . . . . . . . . . . . .

7

2.3

Popular Machine Learning and Statistical Algorithms . . . . .

7

2.4

Penalized regression in published research . . . . . . . . . . .

8

2.4.1

Comparison studies involving penalized regression methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iv

9

CONTENTS
2.4.2
2.5

New algorithms based on penalized regression . . . . .

New algorithms based on ML methods . . . . . . . . . 12

Software packages for ML methods . . . . . . . . . . . . . . . 14

3 Methodology
3.1

3.2

3.3

9

Other ML methods in published research . . . . . . . . . . . . 11
2.5.1

2.6

v

16

Penalized Regression . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.1

Ridge Regression . . . . . . . . . . . . . . . . . . . . . 17

3.1.2

LASSO Regression . . . . . . . . . . . . . . . . . . . . 18

3.1.3

Elastic Net . . . . . . . . . . . . . . . . . . . . . . . . 18

3.1.4

Group LASSO . . . . . . . . . . . . . . . . . . . . . . . 18

3.1.5

Sparse Group LASSO . . . . . . . . . . . . . . . . . . . 19

3.1.6

Drop LASSO . . . . . . . . . . . . . . . . . . . . . . . 20

3.1.7

Big LASSO . . . . . . . . . . . . . . . . . . . . . . . . 21

Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.1

Hierarchical Clustering . . . . . . . . . . . . . . . . . . 22

3.2.2

K-Means Clustering . . . . . . . . . . . . . . . . . . . . 22

K-Fold Cross-Validation . . . . . . . . . . . . . . . . . . . . . 22

CONTENTS
3.4

Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4.1

3.5

vi

ROC AUC . . . . . . . . . . . . . . . . . . . . . . . . . 23

Research design . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4 Experiments and Results

27

4.1

Experimental Data . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.3

Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5 Conclusion

49

5.1

Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.2

Future perspective . . . . . . . . . . . . . . . . . . . . . . . . 51

A Source Codes

64

List of Figures

3.1

Schematic of the proposed algorithm. In this algorithm, there
is a significant reduction in the number of genes prior to the
execution of SGL. Once the final set of genes are selected by
the algorithm, it is used to cluster cells in the data set. . . . . 25

4.1

The gene pool of data set GSE123818 formed by taking the
union of top important genes from ridge, lasso, elastic net, and
droplasso. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.2

The gene pool of data set GSE71585 formed by taking the
union of top important genes from the ridge, lasso, elastic net,
and droplasso. Note that for this data set, the top important
genes of droplasso have no intersection with the other 3 algorithms. In the proposed algorithm, gene pool is formed with
the union of the top important genes from the 4 algorithms
rather than an intersection because there may not always be
an intersection due to the difference in regularization used.

vii

. 34

LIST OF FIGURES
4.3

viii

Average cross-validated AUC for the 4 data sets. Even though
group lasso has better AUC than SGL, SGL is better in terms
of gene selection. Selecting the differentially expressed genes
is of more importance for a scRNA-seq data set compared to
the prediction performance via AUC. . . . . . . . . . . . . . . 36

4.4

Genes vs coefficients plot of data set GSE60749. Here piRNA
44441 is the top important gene. . . . . . . . . . . . . . . . . . 38

4.5

Genes vs coefficients plot of data set GSE71585. Calm2 and
Spap25 are the top important genes. . . . . . . . . . . . . . . 38

4.6

Genes vs coefficients plot of data set GSE81861. FABP1 and
SAT1 are the first 2 top important genes. . . . . . . . . . . . . 39

4.7

Genes vs coefficients plot of data set GSE123818. AT2G43610
and AT4G05320 are the top 2 genes. . . . . . . . . . . . . . . 39

4.8

Cell clustering with final selection of genes for data set GSE60479.
The top important gene (piRNA 44441) alone can differentiate two cell groups perfectly. . . . . . . . . . . . . . . . . . . . 42

4.9

Cell Clustering (K-Means) with final selection of genes for data
set GSE71585. The top important genes Snap25 and Calm2
are able to cluster two cell groups. . . . . . . . . . . . . . . . . 44

4.10 Cell Clustering (K-Means) with final selection of genes for data
set GSE81861. There is some overlap between cell groups
along the axis of top important genes SAT1 and FABP1. . . . 45

LIST OF FIGURES

ix

4.11 Cell Clustering (K-Means) with final selection of genes for data
set GSE123818. There is notable overlap between the short
root knockout and wild type cell groups of Arabidopsis Thaliana. 47

List of Tables

4.1

Experimental data sets . . . . . . . . . . . . . . . . . . . . . . 28

4.2

Comparison of algorithm using CV-AUC. . . . . . . . . . . . . 36

4.3

Variance of CV-AUC. . . . . . . . . . . . . . . . . . . . . . . . 37

4.4

Comparison of algorithm using computation time (Seconds). . 37

4.5

Comparison of performance (AUC) between SGL with all genes
and SGL using new algorithm. . . . . . . . . . . . . . . . . . . 37

4.6

Final selection of genes from GSE60749 . . . . . . . . . . . . . 41

4.7

Final selection of genes from GSE71585 . . . . . . . . . . . . . 43

4.8

Final selection of genes from GSE81861 . . . . . . . . . . . . . 46

4.9

Final selection of genes from GSE123818 . . . . . . . . . . . . 48

x

Chapter 1
Introduction
Single-cell RNA sequencing (scRNA-seq) technology is a recent trend in biological research. Researchers can now simultaneously explore thousands of
cells in a tissue and their average gene expression levels, as well as the gene
expression profile of each individual cell in that tissue with scRNA-seq technology (Slovin et al. [2021]). One of the many applications of scRNA-seq
technology is differentiating tumour cells from normal healthy cells by comparing their molecular signatures. However, the scRNA-seq data itself is not
without its challenges (Kiselev et al. [2019]). One could say it is the very
definition of the curse of dimensionality (p >> n, where p is the number of
variables and n is the number of observations) problem in machine learning
(ML). For scRNA-seq data, the number of genes (variables) far exceeds the
number of cells (observations). One way to solve the p >> n problem is to
employ feature selection.
Feature selection is the method of selecting variables (here, genes) that
are more useful for predicting the target variable. Random forests, Recursive
1

Feature Elimination (RFE) and penalized regression are some of the commonly used feature selection methods in machine learning. Multiple studies
have been published on the application of random forests for scRNA-seq data
(Kaymaz et al. [2020], Pouyan and Kostka [2018]). For very high dimensional
data such as scRNA-seq data, using RFE alone tend to be computationally
expensive (Chen and Jeong [2007]). Furthermore, some of the penalized regression algorithms are developed specifically for scRNA-seq data and showed
varied success.
Penalized regression allows feature selection for high dimensional data
such as scRNA-seq by producing sparse solutions which are predictive models
based on the expression of a limited number of genes. Penalized regression in
machine learning has several versions such as ridge regression, Least Absolute
Shrinkage and Selection Operator (LASSO) regression (Tibshirani [1996])
and a combination of ridge and lasso known as elastic net regression (Zou
and Hastie [2005]). Each of these is useful for a different problem when
dealing with scRNA-seq data. For instance, ridge regression is useful for
bringing some of the coefficients of the model features closer to 0. To reduce
the dimensions, the features close to zero are then proposed to be excluded
while others are retained in the model. This notion is also known as hard
thresholding. This issue is taken care of by lasso regression proposed by
(Tibshirani [2011]), which can make coefficients equal to absolute zero (soft
thresholding). There are many variants of lasso such as Drop lasso (Khalfaoui
and Vert [2018]), Group lasso (Yuan and Lin [2006]), Sparse Group lasso
(Simon et al. [2013]), and Big lasso (Zeng and Breheny [2017]). We are
interested to see which of these methods are suitable for different kinds of
scRNA-seq data.

2

There also exist several other variants of lasso regression such as fused
lasso (Tibshirani et al. [2005]), adaptive lasso (Zou [2006]), and prior lasso
(Jiang et al. [2016]). Fused lasso was proposed for time series or imagebased data, adaptive lasso was proposed for proportional hazards regression,
and prior lasso (Jiang et al. [2016]) was proposed for biological data but
requires prior information to be incorporated into lasso regression. Since
these algorithms are not proposed for scRNA-seq data, they are not included
in this study.
Currently, there are several studies on machine learning algorithms which
claim to be capable of processing high dimensional data, and some are developed specifically for scRNA-seq data. However, to our knowledge, there is
no comprehensive study of how these algorithms perform in comparison with
each other, specifically when dealing with scRNA-seq data. The primary
objective of this study is to fill this gap in knowledge and thereby provide
a comprehensive guideline as to the performance of penalized regression algorithms in terms of the prediction performance as well as the computation
time. The second objective of this study is to select the top-performing algorithms as per the results from the first objective, investigate their combined
performance, and propose a new algorithm that may outperform the top algorithm.
The rest of this thesis is organized as follows. Chapter 2 discusses the
literature review; Chapter 3 explains the research design and proposed algorithm; Chapter 4 discusses the experimental data, presents the results,
analyzes the findings and biological meanings; Chapter 5 proposes future
work directions.

3

Chapter 2
Literature Review
This chapter briefly explains genome sequencing techniques, some of the
popular machine learning and statistical methods for scRNA-seq, and the
existing studies about them. The published studies vary by methods and
data sets as well as the challenges of scRNA-seq data being addressed by the
algorithms. It can be seen that there are very few studies that include more
than one penalized regression method discussed in this research.

2.1

High Throughput Sequencing

The first draft of the human genome sequence was available in 2001 (Lander et al. [2001]). Sanger DNA sequencing was used for this project. This
technology was expensive and limited in throughput. The National Human
Genome Research Institute (NGHRI) funded research in genome sequencing
which led to invention of many High Throughput Sequencing (HTS) tech-

4

nologies. HTS typically involved template preparation, clonal amplification,
and repeated massive parallel sequencing (Reuter et al. [2015]). RNA sequencing (RNA-seq) is a popular HTS method. Bulk RNA-seq measures the
average gene expression of a tissue sample. RNA-seq is used for transcript
analysis. Illumina, Roche 454 and PacBio are some of the commercial platforms for RNA-seq.

2.2

Single Cell RNA Sequencing

scRNA-seq is more advanced than bulk RNA-seq because scRNA-seq can
identify the gene expression of individual cells as well as the average gene
expression of the sample. scRNA-seq is useful for identifying cellular heterogeneity. However, scRNA-seq has low capture efficiency and high dropouts
compared to bulk RNA-seq. Quality control is also necessary for removing
technical noise from scRNA-seq data (Chen et al. [2019]). Machine learning
methods such as the drop lasso regression was designed to overcome the issue of dropout noise in scRNA-seq. Batch effect correction, normalization,
imputation, dimensionality reduction, and feature selection are often used
in scRNA-seq data analysis. scRNA-seq protocols can be classified into two
categories, full-length transcript sequencing approaches and 3’-end or 5’-end
transcript sequencing (Chen et al. [2019]). Smart-seq2, Drop-seq, STRT-seq
and Chromium are some of the popular scRNA-seq methods.

5

2.2.1

Smart-seq2

Smart-seq2 is a full-length transcript sequencing technique. It is characterized by improved reverse transcription, template switching, and preamplification which increase both the yield and length of cDNA libraries generated
from individual cells. Its limitations are the lack of strand specificity and the
inability to detect nonpolyadenylated RNA (Picelli et al. [2013]).

2.2.2

Drop-seq

Drop-seq is a 3’-end transcript sequencing technique. This technology quickly
profiles thousands of individual cells by enabling highly parallel analysis of individual cells by RNA-seq. Drop-seq separates individual cells into nanolitersized aqueous droplets, associating a different barcode with each cell’s RNAs,
and sequencing them all together (Macosko et al. [2015]).

2.2.3

Chromium

Chromium is another example of the 3’-end transcript sequencing technique.
It is a droplet-based system that enables 3’ mRNA counting of tens of thousands of single cells per sample. This technology can capture approximately
50% of cells loaded into the system. Parallel processing enables it to analyze
up to eight samples per run. The droplets undergo reverse transcription, and
barcoded complementary DNAs (cDNAs) are bulk amplified. Illumina shortread sequencing is applied to the resulting libraries followed by Cell Ranger
which processes the sequencing data and enables automated cell clustering

6

(Zheng et al. [2017]).

2.2.4

STRT-seq

STRT-seq is a 5’-end transcript sequencing technique. The advantage of this
technology is that it eliminates the need for known markers to classify cell
types. In STRT-seq scRNA-seq expression profiles are clustered to form a
two-dimensional cell map onto which expression data is projected. Three levels of the organization is integrated into this cell map: the whole population
of cells, the functionally distinct subpopulations it contains, and the single
cells themselves (Islam et al. [2011]).

2.3

Popular Machine Learning and Statistical Algorithms

This section discusses some of the popular machine learning and statistical
algorithms, other than penalized regression algorithms, that have been discussed in academic literature, in conjunction with scRNA-seq data.
We have already seen that data sets such as scRNA-seq present the challenge of high dimensionality. There are several algorithms developed for
dimensionality reduction. Principal component analysis (PCA) and Linear
Discriminant Analysis (LDA) are two such dimensionality reduction methods. Abdi and Williams [2010] describe PCA as a multivariate technique
that extracts information from several inter-correlated quantitative dependent variables and create a set of new orthogonal variables called principal
7

components. LDA finds the projection hyperplane that minimizes the interclass variance and maximizes the distance between the projected means
of the classes (Xanthopoulos et al. [2013]). Recursive feature elimination is
another dimesionality reduction method. However, these methods are not
used in my thesis because of their computational complexity.
Support Vector Machines, Naive Bayes Classifier, Decision Trees, Neural
Networks, Ensemble Methods such as Bagging, Random Forests, and Boosting are some of the machine learning algorithms used for processing high
dimensional data. These algorithms have been studied widely and are not
included in my thesis.

2.4

Penalized regression in published research

Evidence for use of the penalized regression methods for scRNA-seq data in
the existing literature is discussed in this section. If a published study analyses at least one method of penalized regression, it is included for discussion
here.
Lasso regression has been used by a study (Huynh et al. [2019]) for the
identification of a small number of highly influential genes and classification
of cell types in scRNA-seq data with high accuracy. Huynh et al. [2019] used
lasso regression to show that multiple subpopulations of cells existed at all
stages of bone marrow-derived mesenchymal stem cells (MSC) chondrogenesis. However, this study did not include any other variants of the lasso
algorithm such as elasticnet or drop lasso. The next section discusses studies
conducted as a comparative analysis of machine learning methods.

8

2.4.1

Comparison studies involving penalized regression methods

Cao et al. [2021] conducted a systematic evaluation of methods for cell phenotype classification using single-cell RNA sequencing data. This study benchmarked methods such as Elastic net, Naive Bayes (NB) Classifier, and XGBoost (XGB). The result of this study concluded that Elastic Net with interactions performed best in small and medium data sets, NB worked well
with medium data sets, and XGB worked well with large data sets. However,
their study did not shed light on the applicability of penalized regression algorithms except Elastic net on scRNA-seq data.
Scialdone et al. [2015] evaluated five computational methods (RF, logistic regression, lasso regression, and SVM), and proposed a new classification
method called Pairs using the relative expression of “marker pairs” for predicting the cell-cycle stage of single cells from their transcriptome. Their
study showed that sophisticated approaches such as RF or SVM suffered
from overfitting while PCA-based predictor and Pairs method achieved a
strong enough regularization which captured a generalizable cell-cycle signature in the transcriptome.

2.4.2

New algorithms based on penalized regression

Apart from the major machine learning algorithms discussed earlier in this
thesis, several other methods have been developed by combining those algorithms. A few of those methods are discussed below.

9

Hua et al. [2020a] proposed a novel method LAK for clustering singlecell RNA-seq data. LAK is a computational method that uses Lasso and
K-means based feature selection methods. Their study demonstrated that
LAK obtains a better performance in reliability, stability, convenience and
accuracy compared with other computational approaches. Evidently, the
research was about improving lasso but failed to investigate its variants.
Climente-González et al. [2019] proposed block HSIC (the Hilbert–Schmidt
Independence Criterion) Lasso, a non-linear feature selector that overcomes
drawbacks such as lack of parsimony, non-convexity and computational overhead related to scRNA-seq data. The results of their study showed that
features selected by block HSIC Lasso retained more information about the
underlying biology than those selected by other techniques. The drawback
of their study is that block HSIC Lasso was only compared to HSIC Lasso
and least-angle regression (LARS).
Li et al. [2011] proposed a new RNA-Seq based transcriptome assembly
tool called IsoLasso. IsoLasso is a lasso algorithm with some additional
constraints enforced. The study claims that IsoLasso has higher sensitivity
and precision than the state-of-art transcript assembly tools.
Jiang et al. [2020] proposed a Bayesian Robit regression method with
Hyper-LASSO priors (BayesHL) for feature selection in high dimensional
genomic data with grouping structure. BayesHL discards more aggressively
unrelated features than LASSO, and it makes feature selection within groups
automatically without a pre-specified grouping structure. Results show that
BayesHL outperforms alternative methods (including LASSO, group LASSO,
supervised group LASSO, penalized logistic regression, random forest, neural
network, XGBoost and knockoff) in terms of predictive power, sparsity and
10

the ability to uncover grouping structure. However BayesHL study used only
Endometrial Cancer RNA-seq Data. In contrast, my thesis use 4 data sets
from 3 different species to ensure the methods are being tested on a variety
of data sets.
Evidently, none of the above studies discuss the applicability of many
of the penalized regression methods selected for this research. More importantly, no study has investigated the combined performance of lasso algorithms.

2.5

Other ML methods in published research

Huang et al. [2021] discussed computational methods for distinguishing cell
subtypes from the different pathological regions of non-small cell lung cancer
on the basis of transcriptomic profiles. In their study, the random forest
classifier reached a Matthew’s correlation coefficient (MCC) of 0.922 by using
720 features, and the decision tree classifier reached an MCC of 0.786 by using
1880 features.
The popularity of deep learning methods for scRNA-seq data was studied
by Zheng and Wang [2019]. They concluded that while Autoencoders are
the dominant approach, methods based on deep generative models such as
generative adversarial networks (GANs) are emerging in scRNA-seq data
analysis.
Oller-Moreno et al. [2021] studied the emergence of Deep Learning algorithms and the incorporation of prior biological knowledge into Machine
Learning models. They argued that prior knowledge can be seen as model
11

regularization that increases model generalization capabilities and stability
while it supports the biological interpretation of results.
Along with many popular machine learning algorithms, neural networks
also has been employed for analysing scRNA-seq data. Lin et al. [2017]
tested different neural network architectures and used these to reduce the
dimensionality of scRNA-seq data.
Studies were also conducted to reinforce the importance of machine learning in medical research and discussed its future potential. Asada et al.
[2021] investigated the applicability of machine learning techniques such as a
deep convolutional neural network, Latent semantic indexing cluster analysis
which uses dimensional compression to normalize and cluster the data, and
graphical lasso model for single-cell analysis. This study demonstrated the
implementation and usefulness of all three methods.

2.5.1

New algorithms based on ML methods

Alquicira-Hernandez et al. [2019] proposed a new generalized method called
scPred that claim to provide highly accurate classification of single cells,
using a combination of unbiased feature selection from a reduced-dimension
space, and machine-learning probability-based prediction method. However,
this method has not been verified in comparison with existing algorithms.
Similarly, in another study, Tian et al. [2019] developed scDeepCluster
which is a single-cell model-based deep embedded clustering method. This
method simultaneously learns feature representation and clustering via explicit modelling of scRNA-seq data generation.

12

Yan et al. [2020] developed a machine learning method called Single cell
Predictive markers (SPmarker) to identify novel cell-type marker genes in the
Arabidopsis root. As per their sudy, SPmarker method identified hundreds of
new marker genes that were not identified before. However, the performance
of SPmarker has not been compared with other machine learning algorithms.
Aevermann et al. [2021] described a machine learning-based marker gene
selection algorithm, NS-Forest version 2.0, which leverages the nonlinear attributes of random forest feature selection with a binary expression scoring
approach to discover the minimal marker gene expression combinations that
optimally capture the cell-type identity represented in complete scRNA-seq
transcriptional profiles.
scRNA-seq data is often corrupted by dropout noise. As discussed earlier
under penalized regression, droplasso method was developed to overcome this
problem. Some studies have also been conducted to investigate the use of
machine learning to impute missing data due to such dropout noise. Yang
et al. [2018] proposed missing imputation for single-cell RNA (MISC) which
is a robust missing data imputation model using hybrid machine learning for
single-cell RNA-seq data. Their results showed that MISC improved the cell
type classification.
Ensemble methods were used by Xiong et al. [2021] to address the problem of doublets in scRNA-seq data. There are multiple methods to detect
doublets which happens when two cells are captured as one in scrNA-seq
data. Xiong et al. [2021] proposed Chord which integrated multiple doublet
detection methods and demonstrated higher accuracy.
In the study conducted by Hu et al. [2016], the Support Vector Machine

13

based recursive feature elimination (SVM-RFE) method of feature selection
was used for the identification of key markers involved in brain development
from scRNA-seq data.

2.6

Software packages for ML methods

Some studies analysed different software packages for machine learning algorithms for dealing with the challenges of scRNA-seq data. Petegrosso et al.
[2020] conducted a study of machine learning and statistical methods for
clustering single-cell RNA-sequencing data. Their study included clustering
methods such as k-Means, BackSPIN, cellTree, CIDR, DendroSplit, ICGS,
Monocle, pcaReduce, SC3, SCRAT, Seurat 1.0, and SNN-Cliq. Even though
their study included many of the libraries in R, it did not consider Lasso
algorithms for analysis.
Vrahatis et al. [2020] studied recent machine learning approaches for
Single-Cell RNA-seq data analysis. This extensive study included 23 dimensionality reduction methods such as PCA, and LDA, 4 classification methods
such as KNN, Random Forest, and bagging, 11 clustering methods such as
Seurat, and Monocle, and 7 other popular analytical tools such as salmon.
But even this comprehensive study conducted in 2020 also did not include
penalized regression methods.
In conclusion, even though an abundance of literature has been published
on machine learning and statistical methods addressing different challenges
of scRNA-seq data, currently there is a definite lack of a comparative study
of the penalized regression methods for feature selection of the scRNA-seq

14

data.

15

Chapter 3
Methodology
This chapter discusses the penalized regression algorithms that are selected
for my study, explains how research design is developed to answer the research
objectives and describes the proposed algorithm.

3.1

Penalized Regression

Simply speaking, the optimization function of penalized regression has two
parts, a loss function, and a penalty. Let the regression equation be,

Y = Xβ + ϵ,

(3.1)

where Y is a n×1 vector for response variable, X is n×p matrix for predictor
variables, β is p×1 coefficient vector and ϵ ∼ N (0, σ 2 I) is the error term. We
consider that both X and Y are scaled in Eq. 3.1. The estimated penalized
regression coefficients is given by,
16

1
β̂ = argmin( ||Y − Xβ||2 + λ||β||),
n
β

(3.2)

where λ ≥ 0 is the tuning parameter to be estimated using cross-validation.
||β|| is the norm of coefficient vector β. The first term n1 ||Y − Xβ||2 is the
loss function and the second term λ||β|| is the penalty. The main difference between different penalized regressions algorithms is how they apply
the penalty. L1 norm and L2 norm are popular choices for penalty Khalfaoui
and Vert [2018].

L1 norm = ||β||1 =

p
X

|β i |,

(3.3)

i=1

L2 norm = ||β||22 =

p
X

β 2i .

(3.4)

i=1

3.1.1

Ridge Regression

Ridge regression is a penalized regression where the penalty term is the sum
of squared coefficients. It is especially useful when predictor variables are
highly correlated. The estimator of ridge regression is,

1
β̂ = argmin( ||Y − Xβ||2 + λ||β||22 ),
n
β

(3.5)

where ||β||22 is the L2 norm of the coefficients.

17

3.1.2

LASSO Regression

The Least Absolute Shrinkage and Selection Operator (LASSO) proposed by
Tibshirani [1996] minimizes the residual sum of squares subject to the sum
of the absolute coefficients being less than the tuning parameter. Compared
to the ridge regression, which can only shrink coefficients towards 0, the lasso
can make some coefficients exactly equal to zero thereby producing a more
interpretable model. The estimator of lasso regression is,
1
β̂ = argmin( ||Y − Xβ||2 + λ||β||1 ),
n
β

(3.6)

where ||β||1 is the L1 norm of the coefficients.

3.1.3

Elastic Net

Elastic net regression proposed by Zou and Hastie [2005] is a combination of
L1 and L2 penalties of lasso and ridge regression. The estimator of elastic
net is,
1
β̂ = argmin( ||Y − Xβ||2 + λ1 ||β||1 + λ2 ||β||22 ).
(3.7)
n
β
The elastic net regression enjoys the properties of both ridge and lasso regression. The L1 norm part of the penalty generates a sparse model and the
L2 norm part of the penalty encourages greater shrinkage to large coefficients
(Zou and Hastie [2003]).

3.1.4

Group LASSO

Yuan and Lin [2006] proposed Group lasso algorithm for selecting a subset
of important variables which are the main effects with interactions with each
18

other in a regression model. Compared to the lasso model which select individual variables, group lasso select groups of variables. This is particularly
useful in processing scRNA-Seq data because we would like to include or exclude the group of genes that lie in a pathway related to the outcome rather
than individual genes. Assume there are j = 1, 2, · · · , J groups of variables
and n observations. For each group, let Xj be n × pj submatrix of X with
columns corresponding to predictor variables in group j and β j be the corresponding coefficient vector of length pj . Then the regression equation for
group lasso regression can be written as,

Y =

J
X

Xj β j + ϵ.

(3.8)

j=1

Note that for β = (β ′1 , β ′2 , · · · , β ′j )′ , X = (X1 , X2 , · · · .Xj ), and X′j Xj = Ipj ,
the above regression equation simplifies to Eq. (3.1). For a symmetric and
positive definite kernel matrix Kj = pj Ipj , the group lasso estimate is

J
J
X
X
1
1
′
β̂ = argmin( ||Y −
Xj β j ||2 + λ
||β j Kj β j || 2 ),
n
β
j=1
j=1

(3.9)

where the tuning parameter is λ ≥ 0.

3.1.5

Sparse Group LASSO

The shortcoming of group lasso is that while it gives a sparse set of groups,
all the coefficients in a group will be nonzero if that group is included in
the model. But sometimes both sparsity of groups and variables within each
group are desired in a model simultaneously. In the case of scRNA-seq data,
identifying some important genes in the biological pathways is of interest.
19

Simon et al. [2013] proposed sparse group lasso as a solution to this specific
problem. The estimator of sparse group lasso is
J
J
X
X
1
1
′
2
β̂ = argmin( ||Y −
Xj β j || + λ
||β j Kj β j || 2 + αλ||β||1 )),
n
β
j=1
j=1

(3.10)

where the tuning parameter is λ ≥ 0, and α ∈ [0, 1] is a convex combination
of the lasso and group lasso penalties. When α = 0, the model reduces to
group lasso and when α = 1 it reduces to lasso which makes this approach
remarkably similar to elastic net regression.

3.1.6

Drop LASSO

scRNA-seq data is often corrupted by dropout noise. Corruption of data here
means incorrect values. For instance, dropout noise occurs when scRNA-seq
fails to detect some genes even though they are expressed in the cell (Khalfaoui and Vert [2018]). Consequently, those genes (columns) will have zeros
in the data set which is incorrect. Khalfaoui and Vert [2018] proposed Drop
Lasso as a better-adapted solution to data corrupted by dropout noise. It is
a combination of the dropout regularisation technique proposed by Srivastava et al. [2014] and lasso proposed by Tibshirani [1996]. It creates a sparse
linear model robust to the noise by artificially augmenting the training set
with new examples corrupted by dropout. First, a random permutation of
rows is chosen from matrix X with n observations and p predictor variables.
Then each of the chosen rows in X undergo an element-wise multiplication
with a random dropout mask (vector of 1s and 0s) of length p to create a new
dropout corrupted matrix Xdrop . The Drop Lasso estimator is calculated as,

20

1
β̂ = argmin( ||Y − Xdrop β||2 + λ||β||1 ),
n
β

(3.11)

where the tuning parameter λ ≥ 0.

3.1.7

Big LASSO

Zeng and Breheny [2017] proposed Big lasso algorithm for handling ultra-high
dimensional and large scale data in R. Their approach handles out-of-core
computation seamlessly by loading data into memory only when necessary
while model fitting. This is done with the help of memory-mapped files
which stores massive data on the disk. The Big lasso algorithm also possesses
efficient feature screening rules which can accelerate the computation (Zeng
and Breheny [2017]). The estimator of Big lasso regression is,
1
β̂ = argmin( ||Y − Xβ||2 + λ||β||1 ).
n
β

(3.12)

The major differences between Big lasso and lasso are in out-of-core computation and parallel processing.

3.2

Clustering

Clustering is the process of grouping data into clusters so that objects of
the same cluster has high similarity (Rani1 and Rohil [2013]). Two popular
clustering methods are hierarchical clustering and K-Means clustering which
are described below.

21

3.2.1

Hierarchical Clustering

In hierarchical clustering, objects are grouped into a hierarchy of clusters by
nested sequential partitioning (Rani1 and Rohil [2013]). These clusters are
then graphically represented as a tree-like diagram known as dendrogram.
The dendrogram is very useful in deciding the optimal number of clusters.
In this study, we use hierarchical clustering to group genes into clusters prior
to applying group lasso and SGL. This eliminates the need to have deep
knowledge of genes to group them for processing in the algorithms.

3.2.2

K-Means Clustering

K-Means clustering algorithm starts by randomly placing k centroids in p
data points scattered in the n-dimensional space (Hartigan and Wong [1979]).
Clusters are formed by assigning data points to the nearest centroid. The
algorithm progress iteratively by moving the centroids at each step such
that the clustering error is minimized. K-Means clustering when applied in
conjunction with Lasso improved prediction accuracy (Hua et al. [2020b]).
In my study, K-means clustering is employed to cluster cells at the final step
of the proposed methods. Note that the hierarchical clustering is used to
group the genes while K-means clustering is used to cluster cells.

3.3

K-Fold Cross-Validation

Cross-validation is a data partitioning method used to estimate the prediction
error of models and to tune model parameters (Bates et al. [2021]). We will
22

use K-Fold Cross-Validation (CV) for tuning the hyperparameters for all
the algorithms in this study. In K-Fold CV, the data are first divided into k
subsets of cells. One of the k subsets is used as the test set and the remaining
k − 1 subsets are used as the training set. Then prediction error is calculated
for k repeats by selecting a different test set each time. The average error for
k repeats is used as the prediction performance of the model. In this study,
we use 10-fold CV for measuring the performance of the algorithms.

3.4

Metrics

This section explains the metric used to measure performance of my algorithms.

3.4.1

ROC AUC

A receiver operating characteristic curve (ROC) is a graph that plots the true
positive rate (TPR) on the vertical axis and the false positive rate (FPR)
on the horizontal axis (Park et al. [2004]). ROC is used to evaluate the diagnostic ability of a binary classification model. The area under the ROC
curve is known as ROC AUC. The AUC value reflects the overall ranking
performance of a classifier. The AUC is theoretically and empirically better
than the accuracy metric for evaluating the classifier performance and discriminating an optimal solution during the training of a classifier. However,
the computational cost of AUC is high relative to other evaluation metrics
(Hossin and Sulaiman [2015]).

23

3.5

Research design

As the first step of the analysis, each data set is pre-processed to be compatible for the use in different R packages. With the processed data, we
verify how each algorithm performs in terms of AUC and computation time
when dealing with scRNA-seq data. We use the same performance measures
(AUC and computation time) and CV for all data sets to ensure a fair comparison. In this step, a 10-fold Stratified CV is conducted to fine-tune the
hyperparameters for each algorithm, and then the performance metrics for
all the algorithms are calculated. We used hierarchical clustering for grouping variables before Group Lasso and Sparse Group Lasso. After comparing the performance metrics, better-performing algorithms are selected and
combined to form a new algorithm. Cell clustering (using K-Means) with
the final selected genes from the proposed algorithm is used to identify how
well the cell groups in each data are differentiated by those genes. Finally,
the performance metrics of the new algorithm are compared with that of the
top-performing algorithm. Fig. 3.1 illustrates the proposed algorithm.

24

Figure 3.1: Schematic of the proposed algorithm. In this algorithm, there is
a significant reduction in the number of genes prior to the execution of SGL.
Once the final set of genes are selected by the algorithm, it is used to cluster
cells in the data set.

25

Algorithm 1 Steps implementing the proposed algorithm
1. Load data to R and assign classes 1 and 0 to the two selected group of
cells to form a binary classification problem.
2. Remove genes with no variability in expression across all cells.
3. Shuffle cells within each class to randomize data points.
4. Split 90% of the data set into training and 10% into testing data for
10-fold CV.
5. Repeat the steps below for the 10-fold CV
(a) Fit ridge, lasso, elastic net, and drop lasso.
(b) Select the top important genes for each algorithm. The top important genes are the genes which have coefficients above a cut off
(e.g. mean of absolute value of the coefficients).
(c) Form a gene pool by taking the union of the top important genes
from all the 4 models. For example, Fig. 4.1 and Fig. 4.2 represent
gene pool of data set GSE123818 and GSE71585, respectively.
(d) Fit SGL with new gene pool grouped by hierarchical clustering
algorithm.
(e) Save the coefficients of SGL.
6. Find the average of coefficients for each gene across 10 folds and sort
the genes.
7. Visualize the gene vs coefficients plot and select the final set of genes
which are above the elbow of the curve.
8. Cluster all the cells by applying K-means clustering on the top important genes.
26

Chapter 4
Experiments and Results
In this chapter, we discuss the experimental data, results of the analysis, final
subset of important genes selected with the new algorithm, their biological
functions, and clustering of cells with this final subset of genes for each of
the data set separately.

4.1

Experimental Data

In this study, 4 scRNA-seq data sets from 3 different species (Human, mouse
and plant) are used. The first two data sets (GSE60749 and GSE71585)
are selected from a collection of 40 curated scRNA-seq data sets from conquer website (http://imlspenticton.uzh.ch:3838/conquer/) created by Soneson and Robinson [2017]. Conquer website is a collection of consistently
processed, analysis-ready and well documented publicly available scRNA-seq
data sets. There are currently 40 data sets on conquer website, each having

27

Table 4.1: Experimental data sets
Data set

Genes

Cells

Cell Labels

Organism

Source

GSE60749

224444

183:84

mESCs : Dgcr8

Mus mus-

conquer

-/- mESCs

culus

Ntsr1 tdTpos-

Mus mus-

itive

culus

GSE71585

24058

79:57

cells

:

conquer

Ntsr1 tdTnegative cells
GSE81861

57241

272:160

Colorectal

Homo

tumor cells :

sapiens

GEO

Normal mucosa
cells
GSE123818 27629

1099:1099 Shortroot

-

knockout cells :

Arabidopsis GEO
Thaliana

Wild-type cells
counts and transcripts per million (TPM) estimates for genes and transcripts,
as well as quality control and exploratory analysis reports. The other two
data sets (GSE81861 and GSE123818) for this study are downloaded from
Gene Expression Omnibus (GEO). Tab. 4.1 shows the data sets with GEO
accession numbers, number of genes, number of cells in each class, species
and the source from where data was accessed.
The first data set GSE60749, downloaded from Gene Expression Omnibus, is of species Mus musculus. This data set was generated in the study
of gene expression variability in pluripotent stem cells (PSCs) by single-cell
expression profiling of PSCs under different chemical and genetic perturbations conducted by Kumar et al. [2014]. Gene expression levels are quantified
28

as transcripts per million reads (TPM).
For our research, we selected 183 individual v6.5 mouse embryonic stem
cells (mESCs) and 84 Dgcr8 -/- mESCs that lack mature miRNAs (knockout
of a miRNA processing factor). The 183 individual mESCs are assigned to
class 1 and 84 Dgcr8 -/- mESCs are assigned to class 0 to create the binary
classification problem. The data set included 22443 genes initially which
was reduced to 15508 after data preprocessing in which all the genes with
no variance in expression. Some of the genes in this data set are noncoding
piRNAs with numbers as names. Such numbers are converted to text by
prefixing them with ‘RNA ’, before loading data to R to make sure that the
names are not converted as dates.
The second data set GSE71585 is also of the species Mus musculus
(mouse). Our research included mouse species data because numerous clinical trials are conducted on mice prior to human trials. GSE71585 data set
was generated by scRNA-Seq of adult mouse primary visual cortex in a study
conducted by Tasic et al. [2016] where the objective was to understand celltype diversity in the nervous system. There are 1809 cells and 24057 genes in
the data set. Gene expression levels are quantified as transcripts per million
reads (TPM). Out of all the cells, 79 Ntsr1 tdTpositive cell are assigned to
class 0 and 57 Ntsr1 tdTnegative cell are assigned to class 1. After removing
the genes which did not vary in expression across all cells, we left only 17870
genes.
The third data set GSE81861 is of species homo sapiens and is from
the analysis of transcriptional heterogeneity in colorectal tumours (Li et al.
[2017]). Intratumoral heterogeneity is a major obstacle to cancer treatment
and a significant confounding factor in bulk-tumor profiling. Therefore, Li
29

et al. [2017] conducted an analysis of transcriptional heterogeneity in colorectal tumors and their microenvironments using scRNA–seq. There are
272 primary colorectal tumor cells and 160 matched normal mucosa cells.
Gene expression levels are quantified as fragments per kilo-base per million
reads (FPKM). A binary classification problem was created by assigning 272
primary colorectal tumor cells to class 1 and 160 matched normal mucosa
cells to class 0. The data was then transposed to form a matrix of 432 rows
(cells) and 57242 columns (genes). Standardization and normalization were
not carried out for this data set because it negatively affected the performance of Lasso algorithms in preliminary analysis. All the genes which were
not expressed (0 values) or equally expressed among all cells were removed,
thereby reducing the number of genes to 38090.
Our last data set GSE123818 belongs to the plant species Arabidopsis
Thaliana (thale cress). This data is obtained from the study of Spatiotemporal Developmental Trajectories in the Arabidopsis Root by Denyer et al.
[2019]. The objective of this study was to distinguish Arabidopsis root cells
by developmental fate and time. The study generated mRNA profiles of
6-day-old wild-type (wt) and shortroot-knockout (shr) Arabidopsis thaliana
roots by deep sequencing of single cell and bulk RNA libraries (wild type
only), in duplicate (bulk & wild-type single cell) and singlicate (shr-3), using
Illumina NextSeq. From these 4727 wt cells and 1099 shr cells are selected
at random for our research. Class 0 was assigned to wt cells and class 1 was
assigned to shr cells to create the binary classification problem. Class 0 has 0
and class 1 has 1 assigned to the value of response variable. There are 27629
genes in the data set which were reduced to 24075 after removing the genes
which did not vary in expression.

30

4.2

Results

The first objective of this study is to compare the performance of the selected
methods. The results of this comparison in terms of average cross-validated
AUC (CV-AUC) and computation time are shown in Tab. 4.2 and Tab. 4.4
respectively. Fig. 4.3 shows the average CV-AUC by algorithm across all
4 data sets. From Tab. 4.2 and Fig. 4.3 we observe that the top 5 algorithms in the order of importance are SGL, grplasso, droplasso, biglasso, and
Lasso. As evident from Tab. 4.3, the variance of CV-AUC is close to 0 for all
methods when rounded to 2 decimal points. Notice that SGL and grplasso
outperform all other methods in terms of average CV-AUC, whereas ridge
regression algorithm has the least average CV-AUC. This could be because
grplasso and SGL incorporate grouping of genes information into model selection, whereas ridge regression treats all the genes equally. On performing
Friedman test (a non parametric statistical test) with the CV-AUC results
of the 7 methods, we found statistically significant difference between their
performance at p − value = 15%. A post-hoc Nemenyi test revealed that
the difference in performance is between SGL and ridge regression. However,
more data sets need to be analysed in future to verify these results.
The average computation time is the least for SGL and biglasso, while
the most time-consuming algorithm is ridge regression. Ridge regression
use all genes in its final model making the computation complex. On the
other hand, SGL can make an entire group of genes, as well as some of the
genes within selected groups, zero (0) resulting in a sparse matrix and lesser
computation time. It is notable that computation time for GSE81861 data
set is higher compared to that of GSE60749 for most of the algorithms due

31

to larger number of non zero coefficients in the former data set.
The second objective of this study is to combine the top performing
lasso algorithms in order to improve the prediction AUC and gene selection.
From the discussion of the results of the first objective, we see that SGL and
grplasso are good candidates for forming a new algorithm. In terms of gene
selection, SGL performs better than grplasso. SGL could identify the top
important genes in one fold of 10-Fold CV, whereas grplasso takes multiple
folds to get the same result. In other words, the results of top important
genes changed for each fold of 10-fold CV for group lasso compared to SGL.
SGL is therefore chosen over grplasso for the new computational algorithm.
SGL achieve better AUC than biglasso in comparable time for data sets of
size 63 MB to 252 MB when tested on a computer with 32 GB processor.
Since biglasso did not show significant improvement in computation time
compared to SGL, biglasso is not included in the new method.
In the new algorithm, we select the ridge, lasso, elastic net, and droplasso
to form a filter which creates a gene pool with the number of genes in it significantly reduced when compared to the whole. The gene pool is formed by
taking a union of top important genes from 4 algorithms because we observed
that the top important genes have some variations between algorithms. A
union of top important genes is, therefore, more likely to capture the important differentially expressed genes. The gene pool thus formed is grouped
and used as input to SGL and then the AUC of SGL fit is calculated.
We execute the hierarchical clustering to group the genes in the gene
pool and then run SGL. The use of hierarchical clustering avoids the need
to have extensive knowledge about the genes in order to group them. We
noted that each data set had a different number of gene groups and each gene
32

Figure 4.1: The gene pool of data set GSE123818 formed by taking the union
of top important genes from ridge, lasso, elastic net, and droplasso.

33

Figure 4.2: The gene pool of data set GSE71585 formed by taking the union
of top important genes from the ridge, lasso, elastic net, and droplasso. Note
that for this data set, the top important genes of droplasso have no intersection with the other 3 algorithms. In the proposed algorithm, gene pool
is formed with the union of the top important genes from the 4 algorithms
rather than an intersection because there may not always be an intersection
due to the difference in regularization used.

34

group had different number of genes in them. This is acceptable given the
differences between cells and species. There is a reduction of 61% to 92% in
the number of genes in the gene pool compared to the whole for the 4 data set
used in this study. Since hierarchical clustering and SGL are executed with a
significantly reduced number of genes in the new algorithm, it enables us to
execute these steps with an 8 GB processor for two data sets (GSEGSE60749,
GSE71585) and get the same AUC as obtained with the 32 GB processor.
It needs to be mentioned here that this new computational algorithm can
be used for other high dimensional data sets as well for feature selection.
Furthermore, the grouping of genes in the proposed algorithm can be done
without expert knowledge of genes due to the use of hierarchical clustering.
We note that the AUC of the proposed algorithm shown in Tab. 4.5 is equal
to or better than that of SGL. The final selection of genes is found using a
genes vs coefficients plot of SGL fit. Fig. 4.4, 4.5, 4.6, and 4.7 show the
top important genes of 4 data sets. Note that these plots include the first 20
genes of the top important genes only for improved readability, and do not
show all the genes above the cut off. In the validation step, we use K-Means
clustering for cell clustering with the final selection of genes for each data
set. The cell clustering of 4 data sets are shown in Fig. 4.8, 4.9, 4.10, and
4.11.

4.3

Application

As shown in Fig. 4.8, the data set GSE60749 is clustered with two classes
well separated. Tab. 4.6 lists the final selection of genes for this data set.
The top important genes of this data set identified by the new algorithm are

35

Figure 4.3: Average cross-validated AUC for the 4 data sets. Even though
group lasso has better AUC than SGL, SGL is better in terms of gene selection. Selecting the differentially expressed genes is of more importance for a
scRNA-seq data set compared to the prediction performance via AUC.

Table 4.2: Comparison of algorithm using CV-AUC.
Algorithm

R Package

GSE60749

GSE71585

GSE81861

GSE123818

Sparse Group Lasso

SGL

1

0.98

0.92

0.83

Group Lasso

grplasso

1

0.98

0.87

0.99

Drop Lasso

droplasso

0.99

0.94

0.87

0.97

Big Lasso

biglasso

1

1

0.80

0.95

Lasso

glmnet

1

0.96

0.85

0.94

Elastic Net

glmnet

1

0.63

0.86

0.93

Ridge

glmnet

0.99

0.84

0.71

0.90

36

Algorithm

Table 4.3: Variance of CV-AUC.
R Package GSE60749 GSE71585

GSE81861

GSE123818

Sparse Group Lasso

SGL

0

0.0018

0.0018

0.0012

Group Lasso

grplasso

0

0.0018

0.0017

0.0000

Drop Lasso

droplasso

0.0004

0.0047

0.0006

0.0003

Big Lasso

biglasso

0

0

0.0061

0.0061

Lasso

glmnet

0

0.0029

0.0017

0.0003

Elastic Net

glmnet

0

0.0406

0.0021

0.0007

Ridge

glmnet

0.0004

0.0042

0.0089

0.0003

Table 4.4: Comparison of algorithm using computation time (Seconds).
Algorithm

R Package

GSE60749

GSE71585

GSE81861

GSE123818

Sparse Group Lasso

SGL

6.53

1.73

2.97

5.66

Group Lasso

grplasso

1.12

2.51

29.66

3.78

Drop Lasso

droplasso

13.57

7.18

59.33

3.36

Big Lasso

biglasso

3.11

4.77

7.23

20.30

Lasso

glmnet

3.18

2.76

13.39

48.54

Elastic Net

glmnet

3.57

2.66

13.59

51.51

Ridge

glmnet

58.07

26.71

3.77

17.58

Table 4.5: Comparison of performance (AUC) between SGL with all genes
and SGL using new algorithm.
data set

All Genes

Gene Pool

SGL

New algorithm

GSE60749

224444

5965

1

1

GSE71585

24058

5448

0.98

1

GSE81861

57241

5823

0.92

0.94

GSE123818

27629

10857

0.83

0.85

37

Figure 4.4: Genes vs coefficients plot of data set GSE60749. Here piRNA
44441 is the top important gene.

Figure 4.5: Genes vs coefficients plot of data set GSE71585. Calm2 and
Spap25 are the top important genes.

38

Figure 4.6: Genes vs coefficients plot of data set GSE81861. FABP1 and
SAT1 are the first 2 top important genes.

Figure 4.7: Genes vs coefficients plot of data set GSE123818. AT2G43610
and AT4G05320 are the top 2 genes.

39

44441, 44260, 44454, 44446, 44450, 44440, Pbld, Lifr, Hist2h4 and AK203176
as given in Tab. 4.6. One interesting finding is that 44441, 44260, 44454, and
44440 genes are non-coding RNAs called piRNA. This indicates a possibility
of an association between knockout of miRNA processing factor and piRNAs
which has not been studied before. The exact function of this RNA is still
unknown. piRNAs are found in humans and rats, with major clusters occurring in syntenic locations. Although their function must still be resolved,
the abundance of piRNAs in germline cells and the male sterility of Miwi
mutants suggest a role in gametogenesis Girard et al. [2006].
The new algorithm is able to cluster primary visual cortex cell groups in
the data set GSE71585 well (Fig. 4.9). Tab. 4.7 lists the final selection of
genes. The top 2 differentially expressed genes in Ntsr1 (neurotensin receptor
1) tdT (tdTomato - an exceptionally bright red fluorescent protein) positive
cells and Ntsr1 tdT negative cells are Calm2 and Snap25. Calm2 gene is
active in cortex, frontal lobe and a few other organs. It enables N-terminal
myristoylation domain binding, calcium ion and protein binding. It is active
in the pathways of Alzheimer’s disease and Glycogen Metabolism. As per
NCBI, several infants with severe forms of long-QT syndrome (LQTS) who
displayed life-threatening ventricular arrhythmias together with delayed neurodevelopment and epilepsy were found to have mutations in either this gene
or another member of the calmodulin gene family (cal [2022]). Snap25 gene
enables syntaxin-1 binding activity. It is present in 10 different biological
pathways. NCBI records show that it is used to study attention deficit hyperactivity disorder, obesity, schizophrenia, and type 2 diabetes mellitus (sna
[2022]). The human ortholog of this gene is implicated in Down syndrome
and congenital myasthenic syndrome 18. The strong association between the
lack of tdT protein in Ntsr1 cells and these genes identified by our computa40

Table 4.6: Final selection of genes from GSE60749
Gene

Function

Source

44441

piRNA. Function unknown.

ENA

44260

piRNA. Function unknown.

NCBI

44454

piRNA. Function unknown.

RNA Central

44446

Predicted gene. Function unknown.

RGD

44450

Predicted gene. Function unknown.

RGD

44440

piRNA. Function unknown.

piRNAdb

Pbld

Predicted to enable identical protein binding activity

NCBI

and isomerase activity. Predicted to be involved in
maintenance of gastrointestinal epithelium; negative
regulation of SMAD protein signal transduction; and
negative regulation of transforming growth factor beta
receptor signaling pathway.
Lifr

Predicted to enable several functions, including ciliary

NCBI

neurotrophic factor receptor binding activity; growth
factor binding activity; and leukemia inhibitory factor
receptor activity. This gene has also been discussed in
9 pathways including ESC pluripotency pathways.
Hist2h4

It encodes a replication-dependent histone that is a

NCBI

member of the histone H4 family (basic nuclear proteins responsible for the nucleosome structure of the
chromosomal fiber in eukaryotes). This gene is found
in Type II interferon signaling (IFNG) pathway.
AK203176 Predicted to enable GTP binding activity; double-

NCBI

stranded RNA binding activity; and ubiquitin protein
ligase binding activity. Acts upstream of or within cellular response to interleukin-4.

41

Figure 4.8: Cell clustering with final selection of genes for data set GSE60479.
The top important gene (piRNA 44441) alone can differentiate two cell groups
perfectly.
tional algorithm merits further study.
GSE81861 data set cell groups are clustered with some overlap in classes
(Fig. 4.10). Tab. 4.8 lists the final selection of genes for this data set. The
top important genes in this data set are FABP1, SAT1, PHGR1, LGALS4,
FRYL, MT1E, HSP90AA1, and HNRNPH1. FABP1 gene enables long-chain
fatty acids binding activity. It is involved in 13 biological pathways including
metabolism, and Peroxisome proliferator-activated receptor (PPAR) signaling pathway fab [2022]. The second gene SAT1 is also involved in 13 biological
pathways including metabolism. It catalyzes the acetylation of spermidine
and spermine. Defects in this gene are associated with keratosis follicularis
spinulosa decalvans (KFSD) sat [2022]. LGALS4 gene codes galectins which
are a family of beta-galactoside-binding proteins implicated in modulating
42

Table 4.7: Final selection of genes from GSE71585
Gene

Function

Source

Calm2

This gene enables calcium-dependent pro-

NCBI

tein binding activity.

It is involved in

Alzheimer’s disease pathway and Glycogen
metabolism pathway. It is also involved in
several important processes including regulation of response to tumor cell. Human
ortholog of this gene is implicated in long
QT syndrome 15.
Snap25

This gene enables syntaxin-1 binding activ-

NCBI

ity. It is used to study attention deficit hyperactivity disorder; obesity; schizophrenia;
and type 2 diabetes mellitus. Human ortholog of this gene is implicated in Down
syndrome and congenital myasthenic syndrome 18. It is found in 10 different pathways.
0610005C13Rik

This gene is expressed in several structures,

NCBI

including heart; intestine; liver; lung; and
metanephros.
0610007C21Rik

This gene is replaced with name Atraid. It

NCBI

is predicted to be involved in several processes, including negative regulation of osteoblast proliferation; positive regulation of
bone mineralization; and positive regulation
of osteoblast differentiation.
43

Figure 4.9: Cell Clustering (K-Means) with final selection of genes for data
set GSE71585. The top important genes Snap25 and Calm2 are able to
cluster two cell groups.
cell-cell and cell-matrix interactions. This gene is underexpressed in colorectal cancer lga [2022]. Similarly, HSP90AA1 gene is an important gene found
in 115 pathways such as signaling by EGFR, EGFRvIII, and ERBB2in Cancer hsp [2022]. HNRNPH1 gene found in 12 pathways may be associated
with hereditary lymphedema type I. Knockdown of heterogeneous nuclear ribonucleoprotein H1 (HNRNPH1) by siRNA inhibits the early stages of HIV1 replication in 293T cells infected with VSV-G pseudotyped HIV-1 (hnr
[2022]). Looking at Tab. 4.8, we also find a remarkable connection between
the top important genes and HIV-1. Four genes (SAT1, MT1E, HSP90AA1,
and HNRNPH1) out of 8 important genes from a colorectal tumor also have
strong interaction with HIV-1 proteins. The association between cancer and
HIV-1 has been widely studied by medical researchers (König et al. [2008],
Nunnari et al. [2008], Corbeil et al. [1995], Alfano et al. [2013]). Evidently,
44

Figure 4.10: Cell Clustering (K-Means) with final selection of genes for data
set GSE81861. There is some overlap between cell groups along the axis of
top important genes SAT1 and FABP1.
the new algorithm is able to select a highly relevant subset of genes from the
given samples of human colorectal cancer cells.
The cell clusters in GSE123818 have more overlap than the rest of the
data sets. The clustering result is shown in Fig. 4.11. Tab. 4.9 lists the final
selection of genes. The top important genes of this data set as found with the
new algorithm are AT2G43610, AT4G05320, AT2G07698, and AT3G51750.
One of five polyubiquitin genes in Arabidopsis thaliana, AT2G43610 gene is
found in growth and developmental stages such as root development (AT2
[2022a]). AT4G05320 gene encodes the highly conserved 76-amino acid protein ubiquitin which is attached to proteins targeting degradation (AT4
[2022]). AT2G07698 gene is expressed during the seed development stage
(AT2 [2022b]). AT3G51750 codes a hypothetical protein that is involved

45

Table 4.8: Final selection of genes from GSE81861
Gene

Function

Source

FABP1

This gene encodes the fatty acid binding protein found

NCBI

in liver. Biological pathways - 13, such as Peroxisome
proliferator-activated receptor (PPAR) signaling pathway.
SAT1

The protein encoded by this gene is a rate-limiting en-

NCBI

zyme in the catabolic pathway of polyamine metabolism.
Biological pathways - 13. It has HIV-1 interaction and
KFSD.
PHGR1

It is a protien coding gene with biased expression in

NCBI

colon and small intestine.
LGALS4

The galectins are implicated in modulating cell-cell and

NCBI

cell-matrix interactions. The expression of this gene is
restricted to small intestine, colon, and rectum. It is
underexpressed in colorectal cancer.
FRYL

This gene is predicted to be involved in cell morphogen-

NCBI

esis and neuron projection development. It is predicted
to be active in the site of polarized growth.
MT1E

Biological pathways - 5, such as Zinc homeostasis, Cop-

NCBI

per homeostasis. HIV-1 Tat upregulates the interferonresponsive gene expression of Metallothionein, an effect
that likely facilitates the expansion of HIV-1 infection.
HSP90AA1 The protein encoded by this gene aids in the proper

NCBI

folding of specific target proteins. Biological pathways
- 115, such as programmed cell death, innate immune
system. It has strong interactions with HIV-1 proteins.
HNRNPH1 This gene may be associated with hereditary lymphedema type I. Biological pathways - 12, such as
mRNA processing. Knockdown of HNRNPH1 inhibits
the early stages of HIV-1 replication in 293T cells.

NCBI
46

Figure 4.11: Cell Clustering (K-Means) with final selection of genes for data
set GSE123818. There is notable overlap between the short root knockout
and wild type cell groups of Arabidopsis Thaliana.
in root and seed development (AT3 [2022]). All the top important genes
selected by the new computational algorithm are related to growth and developmental stages in Arabidopsis thaliana. We recommend further study of
these genes in relation to root development and degradation.

47

Table 4.9: Final selection of genes from GSE123818
Gene

Function

Source

AT2G43610

This gene is found in growth and devlopmen-

TAIR

tal stages such as root development. It enables
chitinase activity and protein binding.
AT4G05320

One of five polyubiquitin genes in Arabidopsis

TAIR

thaliana. These genes encode the highly conserved 76-amino acid protein ubiquitin that is
covalently attached to substrate proteins targeting most for degradation. This gene enables
mRNA binding, protein tag, ubiquitin protein
ligase binding. The mRNA is cell-to-cell mobile.
AT2G07698

This gene is expressed in growth and develop-

TAIR

mental stages such as seed and seedling development. It enables ADP binding, ATP binding,
poly(U) RNA binding, and zinc ion binding.
AT3G51750

This gene expressed during initial leaves visi-

TAIR

ble stages and flowering stages. The biological
processes associated with this gene are cellular
lipid metabolic process, response to inorganic
substance, response to light stimulus, root development, and seed development.

48

Chapter 5
Conclusion
This research is conducted to address the need for a comparative analysis of
penalized regression algorithms for scRNA-seq data. The algorithms chosen
to study were ridge, Lasso, Elastic Net, Drop Lasso, Group Lasso, Sparse
Group Lasso and Big Lasso regression. The research used a varied set of
scRNA-seq data from 3 different species (Mus Musculus, Homo Sapiens, and
Arabidopsis Thaliana) for analysis. The size of the data set varied from 63
MB to 252 MB. At the end of the analysis, it was found that the Sparse Group
Lasso performed better compared to other methods in terms of average CVAUC, computation time and selection of differentially expressed genes.
This study explored the possibility of improving the performance of the
top algorithm by combining it with others. Upon analysing top important
genes from all algorithms, it was found that the selection of genes may vary
between algorithms to the point that there may not be any intersection in
top important genes from different algorithms. Based on this result, a union
of the top important genes from 4 algorithms (ridge, lasso, elastic net and
49

droplasso) was used to form a gene pool that had a significantly reduced
number of genes which was then used as the input to Sparse Group Lasso.
As evident from the results and discussion, the proposed algorithm that
uses the sparse group lasso with a reduced set of genes can select a highly
relevant subset of genes that are strongly associated with the cell clusters
in scRNA-seq samples. The CV-AUC of the proposed algorithm was found
better than that of the Sparse Group Lasso algorithm. Here we recognize
that lasso algorithms have many hyperparameters which can be customized
to arrive at different results. The output of grplasso and SGL packages can
also change depending on the number of groups in the input data and the
type of grouping used. This research has used scRNA-seq data from 3 species
that are most frequently used in biomedical research. However, the proposed
algorithm may need to be tested on data from more species.

5.1

Contributions

The major contributions of this thesis can be summarized as follows:
• The thesis produced a reliable guideline, about the comparative performance of penalized regression algorithms, which might be useful for
the bioinformatics research community.
• The thesis has proposed a new algorithm that is a combination of lasso
algorithms that showed better AUC that the top-performing lasso algorithm (SGL).
• The proposed algorithm does not require deep knowledge in grouping
of genes in scRNA-seq data and yet identified a highly relevant set of
50

differentially expressed genes.

5.2

Future perspective

This research can take different directions in future. Some of them are listed
below.

1. Current research can expand to include more algorithms and related R
packages, such as Seagull developed by Klosa et al. [2020] which also
implement lasso, group lasso and sparse group lasso algorithms.
2. Another direction worth exploring is verifying the R packages such as
msgl (Vincent and Hansen [2014]) which can implement multinomial
classification.
3. In the proposed algorithm, all of the training data is used as input
to each of the 4 lasso algorithms. Instead, training data itself can be
divided into 4 parts and one part each fed to the 4 lasso algorithms.
Theoretically, this should reduce computation further since each algorithm processes only one-fourth of the training data. How this additional step would affect relevant gene selection need to be studied.
4. A scRNA-seq data set with 1.3 million cells was published by 10x Genomics [2017]. According to 10x Genomics, this is the largest scRNAseq data set available as of date. Even this colossal data set is of size
3.93 GB only. This thesis research can be extended to such large data
sets when the scRNA-seq data set of Gigabyte (GB) size becomes more
commonly available in the future.
51

5. The performance of the proposed algorithm for scRNA-seq data with
highly imbalanced classes may be verified against other methods created for rare class identification such as the ensemble of phalaxes method
(Tomal et al. [2015, 2016]).

52

Bibliography
Shaked Slovin, Annamaria Carissimo, Francesco Panariello, Antonio
Grimaldi, Valentina Bouché, Gennaro Gambardella, and Davide Cacchiarelli. Single-cell rna sequencing analysis: a step-by-step overview. RNA
Bioinformatics, pages 343–365, 2021.
Vladimir Yu Kiselev, Tallulah S Andrews, and Martin Hemberg. Challenges
in unsupervised clustering of single-cell rna-seq data. Nature Reviews Genetics, 20(5):273–282, 2019.
Yasin Kaymaz, Florian Ganglberger, Ming Tang, Francesc Fernandez-Albert,
Nathan Lawless, and Timothy Sackton. Hierfit: Hierarchical random forest
for information transfer. bioRxiv, 2020.
Maziyar Baran Pouyan and Dennis Kostka. Random forest based similarity
learning for single cell rna sequencing data. Bioinformatics, 34(13):i79–i88,
2018.
Xue-wen Chen and Jong Cheol Jeong. Enhanced recursive feature elimination. In Sixth International Conference on Machine Learning and Applications (ICMLA 2007), pages 429–435. IEEE, 2007.
Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal

53

of the Royal Statistical Society: Series B (Methodological), 58(1):267–288,
1996.
Hui Zou and Trevor Hastie. Regularization and variable selection via the
elastic net. Journal of the royal statistical society: series B (statistical
methodology), 67(2):301–320, 2005.
Robert Tibshirani. Regression shrinkage and selection via the lasso: a retrospective series b statistical methodology. Journal of the Royal Statistical
Society, 73(3):273–282, 2011.
Beyrem Khalfaoui and Jean-Philippe Vert. Droplasso: A robust variant of
lasso for single cell rna-seq data. arXiv preprint arXiv:1802.09381, 2018.
Ming Yuan and Yi Lin. Model selection and estimation in regression with
grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49–67, 2006.
Noah Simon, Jerome Friedman, Trevor Hastie, and Robert Tibshirani. A
sparse-group lasso. Journal of computational and graphical statistics, 22
(2):231–245, 2013.
Yaohui Zeng and Patrick Breheny. The biglasso package: A memory-and
computation-efficient solver for lasso model fitting with big data in r. arXiv
preprint arXiv:1701.05936, 2017.
Robert Tibshirani, Michael Saunders, Saharon Rosset, Ji Zhu, and Keith
Knight. Sparsity and smoothness via the fused lasso. Journal of the Royal
Statistical Society: Series B (Statistical Methodology), 67(1):91–108, 2005.
Hui Zou. The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476):1418–1429, 2006.
54

Yuan Jiang, Yunxiao He, and Heping Zhang. Variable selection with prior
information for generalized linear models via the prior lasso method. Journal of the American Statistical Association, 111(513):355–376, 2016.
Eric S Lander, Lauren M Linton, Bruce Birren, Chad Nusbaum, Michael C
Zody, Jennifer Baldwin, Keri Devon, Ken Dewar, Michael Doyle, William
FitzHugh, et al. Initial sequencing and analysis of the human genome.
2001.
Jason A. Reuter, Damek V. Spacek, and Michael P. Snyder. High-throughput
sequencing technologies. Molecular Cell, 58(4):586–597, 2015. ISSN 10972765. doi: https://doi.org/10.1016/j.molcel.2015.05.004. URL https://
www.sciencedirect.com/science/article/pii/S1097276515003408.
Geng Chen, Baitang Ning, and Tieliu Shi. Single-cell rna-seq technologies and related computational data analysis. Frontiers in Genetics, 10,
2019. ISSN 1664-8021. doi: 10.3389/fgene.2019.00317. URL https:
//www.frontiersin.org/article/10.3389/fgene.2019.00317.
Simone Picelli, Åsa K Björklund, Omid R Faridani, Sven Sagasser, Gösta
Winberg, and Rickard Sandberg. Smart-seq2 for sensitive full-length transcriptome profiling in single cells. Nature methods, 10(11):1096–1098, 2013.
Evan Z Macosko, Anindita Basu, Rahul Satija, James Nemesh, Karthik
Shekhar, Melissa Goldman, Itay Tirosh, Allison R Bialas, Nolan Kamitaki, Emily M Martersteck, et al. Highly parallel genome-wide expression
profiling of individual cells using nanoliter droplets. Cell, 161(5):1202–
1214, 2015.
Grace XY Zheng, Jessica M Terry, Phillip Belgrader, Paul Ryvkin,
Zachary W Bent, Ryan Wilson, Solongo B Ziraldo, Tobias D Wheeler,
55

Geoff P McDermott, Junjie Zhu, et al. Massively parallel digital transcriptional profiling of single cells. Nature communications, 8(1):1–12, 2017.
Saiful Islam, Una Kjällquist, Annalena Moliner, Pawel Zajac, Jian-Bing Fan,
Peter Lönnerberg, and Sten Linnarsson. Characterization of the single-cell
transcriptional landscape by highly multiplex rna-seq. Genome research,
21(7):1160–1167, 2011.
Hervé Abdi and Lynne J Williams. Principal component analysis. Wiley
interdisciplinary reviews: computational statistics, 2(4):433–459, 2010.
Petros Xanthopoulos, Panos M Pardalos, and Theodore B Trafalis. Linear
discriminant analysis. In Robust data mining, pages 27–33. Springer, 2013.
Nguyen PT Huynh, Natalie H Kelly, Dakota B Katz, Minh Pham, and
Farshid Guilak. Single cell rna sequencing reveals heterogeneity of human msc chondrogenesis: Lasso regularized logistic regression to identify
gene and regulatory signatures. bioRxiv, page 854406, 2019.
Xiaowen Cao, Li Xing, Elham Majd, Hua He, Junhua Gu, and Xuekui Zhang.
A systematic evaluation of methods for cell phenotype classification using
single-cell rna sequencing data. arXiv preprint arXiv:2110.00681, 2021.
Antonio Scialdone, Kedar N Natarajan, Luis R Saraiva, Valentina Proserpio,
Sarah A Teichmann, Oliver Stegle, John C Marioni, and Florian Buettner.
Computational assignment of cell-cycle stage from single-cell transcriptome data. Methods, 85:54–61, 2015.
Jiao Hua, Hongkun Liu, Boyang Zhang, and Shuilin Jin. Lak: lasso and
k-means based single-cell rna-seq data clustering analysis. IEEE Access,
8:129679–129688, 2020a.
56

Héctor Climente-González, Chloé Agathe Azencott, Samuel Kaski, Makoto
Yamada, et al. Block HSIC Lasso: model-free biomarker detection for
ultra-high dimensional data. Bioinformatics, 35(14):i427–i435, 07 2019.
ISSN 1367-4803. doi: 10.1093/bioinformatics/btz333. URL https://doi.
org/10.1093/bioinformatics/btz333.
Wei Li, Jianxing Feng, and Tao Jiang. Isolasso: a lasso regression approach to
rna-seq based transcriptome assembly. Journal of Computational Biology,
18(11):1693–1707, 2011.
Lai Jiang, Celia MT Greenwood, Weixin Yao, and Longhai Li. Bayesian
hyper-lasso classification for feature selection with application to endometrial cancer rna-seq data. Scientific reports, 10(1):1–16, 2020.
Guo-Hua Huang, Yu-Hang Zhang, Lei Chen, You Li, Tao Huang, and YuDong Cai. Identifying lung cancer cell markers with machine learning
methods and single-cell rna-seq data. Life, 11(9):940, 2021.
Jie Zheng and Ke Wang. Emerging deep learning methods for single-cell
rna-seq data analysis. Quantitative Biology, 7(4):247–254, 2019.
Sergio Oller-Moreno, Karin Kloiber, Pierre Machart, and Stefan Bonn. Algorithmic advances in machine learning for single-cell expression analysis.
Current Opinion in Systems Biology, 25:27–33, 2021.
Chieh Lin, Siddhartha Jain, Hannah Kim, and Ziv Bar-Joseph. Using neural
networks for reducing the dimensions of single-cell rna-seq data. Nucleic
acids research, 45(17):e156–e156, 2017.
K Asada, K Takasawa, H Machino, S Takahashi, N Shinkai, A Bolatkan,
K Kobayashi, M Komatsu, S Kaneko, K Okamoto, et al. Single-cell analysis
57

using machine learning techniques and its application to medical research.
biomedicines 2021, 9, 1513, 2021.
Jose Alquicira-Hernandez, Anuja Sathe, Hanlee P Ji, Quan Nguyen, and
Joseph E Powell. scpred: accurate supervised method for cell-type classification from single-cell rna-seq data. Genome biology, 20(1):1–17, 2019.
Tian Tian, Ji Wan, Qi Song, and Zhi Wei. Clustering single-cell rna-seq data
with a model-based deep learning approach. Nature Machine Intelligence,
1(4):191–198, 2019.
Haidong Yan, Qi Song, Jiyoung Lee, John Schiefelbein, and Song Li. Identification of cell-type marker genes from plant single-cell rna-seq data using
machine learning. bioRxiv, 2020.
Brian Aevermann, Yun Zhang, Mark Novotny, Mohamed Keshk, Trygve
Bakken, Jeremy Miller, Rebecca Hodge, Boudewijn Lelieveldt, Ed Lein,
and Richard H Scheuermann. A machine learning method for the discovery of minimum marker gene combinations for cell type identification from
single-cell rna sequencing. Genome research, 31(10):1767–1780, 2021.
Mary Qu Yang, Sherman M Weissman, William Yang, Jialing Zhang, Allon
Canaann, and Renchu Guan. Misc: missing imputation for single-cell rna
sequencing data. BMC systems biology, 12(7):55–63, 2018.
Ke-Xu Xiong, Han-Lin Zhou, Jian-Hua Yin, Karsten Kristiansen, HuanMing Yang, and Gui-Bo Li. Chord: Identifying doublets in single-cell
rna sequencing data by an ensemble machine learning algorithm. bioRxiv,
2021.
Yongli Hu, Takeshi Hase, Hui Peng Li, Shyam Prabhakar, Hiroaki Kitano,
See Kiong Ng, Samik Ghosh, and Lawrence Jin Kiat Wee. A machine
58

learning approach for the identification of key markers involved in brain
development from single-cell transcriptomic data. BMC genomics, 17(13):
19–29, 2016.
Raphael Petegrosso, Zhuliu Li, and Rui Kuang. Machine learning and statistical methods for clustering single-cell rna-sequencing data. Briefings in
bioinformatics, 21(4):1209–1223, 2020.
Aristidis G Vrahatis, Sotiris K Tasoulis, Ilias Maglogiannis, and Vassilis P
Plagianakos. Recent machine learning approaches for single-cell rna-seq
data analysis. In Advanced Computational Intelligence in Healthcare-7,
pages 65–79. Springer, 2020.
Hui Zou and Trevor Hastie. Regression shrinkage and selection via the elastic
net, with applications to microarrays. JR Stat Soc Ser B, 67:301–20, 2003.
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and
Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks
from overfitting. The journal of machine learning research, 15(1):1929–
1958, 2014.
Yogita Rani1 and Harish Rohil. A study of hierarchical clustering algorithm.
ter S & on Te SIT, 2:113, 2013.
John A Hartigan and Manchek A Wong. Algorithm as 136: A k-means clustering algorithm. Journal of the royal statistical society. series c (applied
statistics), 28(1):100–108, 1979.
Jiao Hua, Hongkun Liu, Boyang Zhang, and Shuilin Jin. Lak: lasso and
k-means based single-cell rna-seq data clustering analysis. IEEE Access,
8:129679–129688, 2020b.
59

Stephen Bates, Trevor Hastie, and Robert Tibshirani.

Cross-validation:

what does it estimate and how well does it do it?

arXiv preprint

arXiv:2104.00673, 2021.
Seong Ho Park, Jin Mo Goo, and Chan-Hee Jo. Receiver operating characteristic (roc) curve: practical review for radiologists. Korean journal of
radiology, 5(1):11–18, 2004.
Mohammad Hossin and Md Nasir Sulaiman. A review on evaluation metrics
for data classification evaluations. International journal of data mining &
knowledge management process, 5(2):1, 2015.
Charlotte Soneson and Mark D Robinson. Bias, robustness and scalability
in differential expression analysis of single-cell rna-seq data. bioRxiv, page
143289, 2017.
Roshan M Kumar, Patrick Cahan, Alex K Shalek, Rahul Satija, A Jay DaleyKeyser, Hu Li, Jin Zhang, Keith Pardee, David Gennert, John J Trombetta, et al. Deconstructing transcriptional heterogeneity in pluripotent
stem cells. Nature, 516(7529):56–61, 2014.
Bosiljka Tasic, Vilas Menon, Thuc Nghi Nguyen, Tae Kyung Kim, Tim
Jarsky, Zizhen Yao, Boaz Levi, Lucas T Gray, Staci A Sorensen, Tim
Dolbeare, et al. Adult mouse cortical cell taxonomy revealed by single cell
transcriptomics. Nature neuroscience, 19(2):335–346, 2016.
Huipeng Li, Elise T Courtois, Debarka Sengupta, Yuliana Tan, Kok Hao
Chen, Jolene Jie Lin Goh, Say Li Kong, Clarinda Chua, Lim Kiat Hon,
Wah Siew Tan, et al. Reference component analysis of single-cell transcriptomes elucidates cellular heterogeneity in human colorectal tumors.
Nature genetics, 49(5):708–718, 2017.
60

Tom Denyer, Xiaoli Ma, Simon Klesen, Emanuele Scacchi, Kay Nieselt, and
Marja CP Timmermans. Spatiotemporal developmental trajectories in the
arabidopsis root revealed using high-throughput single-cell rna sequencing.
Developmental cell, 48(6):840–852, 2019.
Angélique Girard, Ravi Sachidanandam, Gregory J Hannon, and Michelle A
Carmell. A germline-specific class of small rnas binds mammalian piwi
proteins. Nature, 442(7099):199–202, 2006.
Calm2 calmodulin 2 [mus musculus (house mouse)] - gene - ncbi, 2022. URL
https://www.ncbi.nlm.nih.gov/gene/12314.
Snap25 synaptosomal-associated protein 25 [Mus musculus (house mouse)] Gene - NCBI, 2022. URL https://www.ncbi.nlm.nih.gov/gene/20614.
Fabp1 fatty acid binding protein 1, liver [Mus musculus (house mouse)] Gene - NCBI, 2022. URL https://www.ncbi.nlm.nih.gov/gene/14080.
SAT1 spermidine/spermine N1-acetyltransferase 1 [Homo sapiens (human)]
- Gene - NCBI, 2022. URL https://www.ncbi.nlm.nih.gov/gene/6303.
LGALS4 galectin 4 [Homo sapiens (human)] - Gene - NCBI, 2022. URL
https://www.ncbi.nlm.nih.gov/gene/3960.
HSP90AA1 heat shock protein 90 alpha family class A member 1 [Homo
sapiens (human)] - Gene - NCBI, 2022. URL https://www.ncbi.nlm.
nih.gov/gene/3320.
HNRNPH1 heterogeneous nuclear ribonucleoprotein H1 [Homo sapiens (human)] - Gene - NCBI, 2022. URL https://www.ncbi.nlm.nih.gov/gene/
3187.

61

Renate König, Yingyao Zhou, Daniel Elleder, Tracy L Diamond, Ghislain MC
Bonamy, Jeffrey T Irelan, Chih-yuan Chiang, Buu P Tu, Paul D De Jesus,
Caroline E Lilley, et al. Global analysis of host-pathogen interactions that
regulate early-stage hiv-1 replication. Cell, 135(1):49–60, 2008.
Giuseppe Nunnari, Johanna A Smith, and René Daniel. Hiv-1 tat and aidsassociated cancer: targeting the cellular anti-cancer barrier? Journal of
Experimental & Clinical Cancer Research, 27(1):1–8, 2008.
Jacques Corbeil, Louise A Evans, Paul W McQueen, Eva Vasak, Paul D
Edward, Douglas D Richman, Ronald Penny, and David A Cooper. Productive in vitro infection of human umbilical vein endothelial cells and
three colon carcinoma cell lines with hiv-1. Immunology and cell biology,
73(2):140–145, 1995.
Massimo Alfano, Francesca Graziano, Luca Genovese, and Guido Poli.
Macrophage polarization at the crossroad between hiv-1 infection and cancer development. Arteriosclerosis, thrombosis, and vascular biology, 33(6):
1145–1152, 2013.
The arabidopsis information resource (tair), 2022a.

URL https:

//www.arabidopsis.org/servlets/TairObject?type=locus&name=
At2g43610. on: www.arabidopsis.org.
The arabidopsis information resource (tair), 2022.

URL https://www.

arabidopsis.org/servlets/TairObject?type=locus&id=126703.

on:

www.arabidopsis.org.
The arabidopsis information resource (tair), 2022b.

URL https:

//www.arabidopsis.org/servlets/TairObject?type=locus&name=
At2g07698. on: www.arabidopsis.org.
62

The arabidopsis information resource (tair),

2022.

URL https:

//www.arabidopsis.org/servlets/TairObject?type=locus&name=
At3g51750. on: www.arabidopsis.org.
Jan Klosa, Noah Simon, Pål Olof Westermark, Volkmar Liebscher, and Dörte
Wittenburg. Seagull: lasso, group lasso and sparse-group lasso regularization for linear regression models via proximal gradient descent. BMC
bioinformatics, 21(1):1–8, 2020.
Martin Vincent and Niels Richard Hansen. Sparse group lasso and high
dimensional multinomial classification. Computational Statistics & Data
Analysis, 71:771–786, 2014.
10x Genomics.

1.3 million brain cells from e18 mice. single

cell gene expression dataset by cell ranger 1.3.0., 2017.

URL

https://support.10xgenomics.com/single-cell-gene-expression/
datasets/1.3.0/1M_neurons?
Jabed H Tomal, William J Welch, and Ruben H Zamar. Ensembling classification models based on phalanxes of variables with applications in drug
discovery. The Annals of Applied Statistics, 9(1):69–93, 2015.
Jabed H Tomal, William J Welch, and Ruben H Zamar. Exploiting multiple descriptor sets in qsar studies. Journal of Chemical Information and
Modeling, 56(3):501–509, 2016.

63

Appendix A
Source Codes
The R codes for the proposed method is in the GitHub Repository below,
https://github.com/bhavithry/Benchmarking-LASSO-R

64