THOMPSON RIVERS UNIVERSITY

Machine Learning and Patient Partner Engagement to
Predict the Usage of Home Care Services

By

Robin Teotia

A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF
THE REQUIREMENTS FOR THE DEGREE OF
Master of Science in Data Science
KAMLOOPS, BRITISH COLUMBIA

[April, 2022]

Supervisors
Dr. Piper Jackson
Dr. Jabed Tomal

ABSTRACT
This research is a comparative analysis of the application of different
machine-learning methods to health care data to predict home care usage in
consultation with patient partner involvement. The data used are from the
interRAI Home Care assessment after instrument, collected in central British
Columbia, Canada. The original data set used contains 837,536 records,
gathered from 2010 to 2019, and 423 attributes. The model is developed
for predicting the average hours per day usage of home care services in the
three weeks following an assessment using different regression and classification methods. For regression, I used multiple linear model, lasso, ridge,
decision tree, and ensemble methods, where the last appeared as the most
promising. For classification, I used KNN, logistic regression, decision tree,
and ensemble methods. Apart from the machine learning algorithms, both
patient partners and health care experts participated and provided feedback
regarding home care practices and issues. These formed essential elements
in designing the research questions, selecting variables, and improving the
models. The ensemble methods, namely Random Forests and Bagged trees,
are found promising for both regression and classification problems. The
Random Forests has achieved the largest R2 (0.53) in predicting the average
hours per day. For classification, the largest accuracy and ROC AUC scores
are 0.96 and 0.97 respectively, obtained from the Random Forest and Bagging algorithms.
Key Words: machine learning; healthcare; ensemble methods; Random
Forests s; k-nearest neighbors; patient-oriented research; feature selection.

ii

ACKNOWLEDGEMENTS
I am very grateful to Dr. Piper Jackson, Dr. Shannon Freeman, Dr. Jabed
Tomal, and Dr. Yan Yan for their invaluable guidance towards the development of this thesis. I would like to thank the BC Support Unit for funding
this research, and for investing in advancing data science using a patientoriented research approach in British Columbia. I would also like to thank
Holly Buhler at Interior Health for her support and guidance in working with
this data. Finally, I am very grateful for the many contributions and generous feedback about the predictive model from our patient partners: Brent
Baker, Ivy Muturi, S. Carl Zanon, Susan Prior, and Grace D. Kramer.

iii

Contents

1 Introduction

1

1.1

Health Care Data Management and Analysis . . . . . . . . . .

1

1.2

Project Goal and Patient Partners Contribution . . . . . . . .

3

1.3

Machine Learning with the Health Data . . . . . . . . . . . .

4

1.3.1

5

Research Questions . . . . . . . . . . . . . . . . . . . .

2 Background

9

2.1

Home Care . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.2

Health Care and Machine Learning . . . . . . . . . . . . . . . 10

2.3

K-Nearest-Neighbors . . . . . . . . . . . . . . . . . . . . . . . 12

2.4

Decision Tree and Ensemble Learning . . . . . . . . . . . . . . 13

2.5

Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . 15

2.6

Ridge and Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . 17
iv

9

CONTENTS

v

2.7

Cross-Validation (CV) . . . . . . . . . . . . . . . . . . . . . . 18

2.8

Quantiles and Percentiles . . . . . . . . . . . . . . . . . . . . . 21

2.9

Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . 22

2.10 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . 22
2.10.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . 23
2.10.2 ROC AUC Curve . . . . . . . . . . . . . . . . . . . . . 25
2.10.3 R2 and MSE . . . . . . . . . . . . . . . . . . . . . . . . 26

3 Data

28

3.1

Basis for Splitting the Target . . . . . . . . . . . . . . . . . . 29

3.2

Data Cleaning and Preparation . . . . . . . . . . . . . . . . . 34

4 Methodology

36

4.1

Environment Setup for Acquiring and Processing Data . . . . 36

4.2

Data Wrangling, Visualization, and Exploratory Analysis . . . 37

4.3

Data Preparation and Cleaning . . . . . . . . . . . . . . . . . 38

4.4

Dichotomizing the Response Variable using Quantiles and Percentiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.5

Encoding of the Data: . . . . . . . . . . . . . . . . . . . . . . 41

CONTENTS
4.6

vi

The Process of Feature Selection . . . . . . . . . . . . . . . . . 42
4.6.1

Finding Independent Variables . . . . . . . . . . . . . . 43

4.6.2

Recursive Feature Elimination with Random Forests

4.6.3

Feature Selection with Regularization Method of Lasso 44

. 43

4.7

Machine Learning Cross-Validation Methods . . . . . . . . . . 44

4.8

Machine Learning for Regression Problems . . . . . . . . . . . 45

4.9

Machine Learning for Classification Problems . . . . . . . . . 46

4.10 Computational Resources

5 Results

. . . . . . . . . . . . . . . . . . . . 46

47

5.1

Selected Features . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.2

Results for Regression Problem . . . . . . . . . . . . . . . . . 50

5.3

Results for Classification Problems . . . . . . . . . . . . . . . 53
5.3.1

Classification Results for Mean . . . . . . . . . . . . . 53

5.3.2

Classification Results for Median . . . . . . . . . . . . 58

5.3.3

Classification Results for 75th Percentile . . . . . . . . 61

5.3.4

Classification Results for 90th Percentile . . . . . . . . 64

5.3.5

Classification Results for 95th Percentile . . . . . . . . 67

CONTENTS

vii

6 Discussion

71

7 Conclusion

79

A Program Code

87

List of Figures

1.1

Flow chart of the proposed methodologies. . . . . . . . . . . .

6

2.1

K-fold cross-validation (k = 4). . . . . . . . . . . . . . . . . . 19

2.2

Stratified k-fold cross-validation. . . . . . . . . . . . . . . . . . 21

2.3

Over-sampling and under-sampling. . . . . . . . . . . . . . . . 22

2.4

Confusion matrix for the binary classification problem. . . . . 24

2.5

ROC AUC curve. The x-axis represents the true positive rate
(TPR) while the y-axis denotes the false positive rate (FPR).
The blue area under the red curve (ROC) is AUC.

3.1

. . . . . . 26

Formation of classes when the response variable “Average Hours
Per Day” is dichotomized using mean (“Low” is below the
mean; “High” is above the mean). . . . . . . . . . . . . . . . . 30

3.2

Boxplot of the response variable (Average Hours Per Day).
Note the outliers which are more than 24 hours (per day). . . 31

viii

LIST OF FIGURES

ix

3.3

Histogram of the response variable (Average Hours Per Day).

32

3.4

Histogram of the response variable after zoomed in. . . . . . . 32

3.5

Formation of classes when the response variable “Average Hours
Per Day” is dichotomized using 90th percentile (“Low” is below the 90th percentile; “High” is above the 90th percentile). . 34

3.6

Flow chart of the data cleaning and preparation. . . . . . . . . 35

4.1

Connection of Python with MS-SQL. . . . . . . . . . . . . . . 37

4.2

Distribution of the response variable after removing outliers. . 40

4.3

Boxplot of the response variable after removing outliers. . . . 40

5.1

R2 for the regression problem, which was used to predict the
target (Average Hours Per Day). . . . . . . . . . . . . . . . . 51

5.2

MSE for the regression problem, which was used to predict
the target (Average Hours Per Day). . . . . . . . . . . . . . . 51

5.3

Accuracy of classification algorithms when the target is dichotomized using mean. . . . . . . . . . . . . . . . . . . . . . 54

5.4

KNN with Shuffle-Split CV gives its highest accuracy of 0.787
when the K value is 5 for the given range of K.

5.5

. . . . . . . . 55

KNN with 10-fold CV gives its highest accuracy of 0.783 when
the K value is 5 or 9 for the given range of K. . . . . . . . . . 55

LIST OF FIGURES
5.6

x

ROC AUC scores of classification when the target is dichotomized
using the mean. . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.7

ROC AUC scores of classification when the target is dichotomized
using the median. . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.8

ROC AUC scores of classification when the target is dichotomized
using the 75th percentile. . . . . . . . . . . . . . . . . . . . . . 63

5.9

ROC AUC scores of classification when the target is dichotomized
using the 90th percentile. . . . . . . . . . . . . . . . . . . . . . 66

5.10 ROC AUC scores of classification when the target is dichotomized
using the 95th percentile. . . . . . . . . . . . . . . . . . . . . . 69

List of Tables

3.1

Summary of the response variable . . . . . . . . . . . . . . . . 33

4.1

Values after and before removing outliers from the response
variable “Average Hours Per Day” when 0 ≤ AverageHoursP erDay ≤
24. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.1

Prediction of the first 10 values of the response variable “Average Hours Per Day” using Random Forests. The bold predicted values are close to the actual values. . . . . . . . . . . . 52

5.2

Classification evaluation using accuracy when the target is dichotomised using median. The highest accuracies are highlighted in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.3

Confusion matrix for median classification for one fold within
stratified 10-fold cross-validation. . . . . . . . . . . . . . . . . 59

5.4

Formation of classes when the response variable “Average Hours
Per Day” is dichotomized using the 75th percentile. . . . . . . 61

xi

LIST OF TABLES
5.5

xii

75th percentile classification evaluation using accuracy. The
highest accuracies are highlighted in bold. . . . . . . . . . . . 61

5.6

Confusion matrix for 75th percentile classification for one fold
within stratified 10-fold cross-validation. . . . . . . . . . . . . 62

5.7

Formation of classes when the response variable “Average Hours
Per Day” is dichotomized using the 90th percentile. . . . . . . 64

5.8

90th percentile classification evaluation using accuracy. The
highest accuracies are highlighted in bold. . . . . . . . . . . . 65

5.9

Confusion matrix for 90th percentile classification for one fold
within stratified 10-fold cross-validation. . . . . . . . . . . . . 65

5.10 Formation of classes when the response variable “Average Hours
Per Day” is dichotomized using the 95th percentile. . . . . . . 67
5.11 95th percentile classification evaluation using accuracy. The
highest accuracies are highlighted in bold. . . . . . . . . . . . 67
5.12 Confusion matrix for 95th percentile classification for one fold
within stratified 10-fold cross-validation. . . . . . . . . . . . . 68

Chapter 1
Introduction
In today’s world, the health problems that humanity is experiencing
require immediate care. Health care is a vital area of society. With the
advancement of technology and research, healthcare institutions are
producing massive amounts of data, requiring proper data management and
analysis.

1.1

Health Care Data Management and Analysis

Data collection has become a vital part of every private or public
organization. The health care industry is recording data in terms of medical
reports, medical history, and medical results of the patients [Dash et al.,
2019]. In order to address the health crisis, a proper analysis of the data is
needed. This huge amount of data is an untapped wealth of information in
1

health science that can potentially be harnessed by using machine learning
(ML) algorithms by detecting patterns and forecasting. It is essential to
draw accurate inferences and information from the analysis of the available
data through machine learning to address critical health issues. It is also
important to be able to present findings to health authorities and policy
makers so that they can find the basis to formulate new policies and deploy
new systems and devices in order to ensure the good health of society.
As the Canadian population ages, the health care system will be
expected to serve increased demands and expectations, including a higher
prevalence of persons living with chronic conditions and a common
expectation to support Canadians to live at home as long as possible, the
research is mentioned in the report Canadian Institute for Health
Information [2017]. In 2015-2016, 6.4% of Canadians (881,800 Canadian
households) reported one or more persons in their home had received home
care services in the previous year, most often nursing services and
personal/home supports, stated by Gilmour [2019]. With the growing
number of individuals requiring care and support from community-based
home care services and supports, it has become paramount to ensure that
resources allocated are able to support Canadians to live and age in the
right place at the right time. To develop evidence-based solutions to
challenges in the health system and to inform more accurate forecasting of
health service demands and utilization, proper organization and uptake of
data are needed.

2

1.2

Project Goal and Patient Partners Contribution

The research forms a part of the goal of Canada’s Strategy for
Patient-Oriented Research (SPOR) [Canadian Institutes of Health
Research, 2019]. Under this, there is active participation of patient
partners, researchers, domain experts, healthcare providers, and
decision-makers working together to develop a sustainable and accessible
healthcare system and bring positive health changes into society.
Mentioned in Canadian Institutes of Health Research [2019], the goals of
patient-oriented research are:
1. To improve health.
2. To enhance access to the health care system.
3. To ensure the right treatment at right time.
4. To be an active informer of health care.
5. To make pragmatic efforts and contributions to make the Canadian
health system effective.
Active patient participation in studies will strengthen the significance of
the research and its implementation into policy and practice, resulting in
improved and efficient health services and products and eventually boosting
Canadians’ quality of life and enhancing the Canadian healthcare system.
Researchers should work in direct consultation with patient partners,
domain experts, researchers, and health care experts in order to ensure they
are going in the right direction, which is imperative to accomplish the goals
of their project.

3

1.3

Machine Learning with the Health Data

This research is an exploration of how advanced data science methods
can be used to improve health care decision making in our home region,
central British Columbia. In particular, I am focusing on predicting
transitions in the healthcare needs of older adults in our communities so
that healthcare resources will be ready for them when they are needed.
Machine learning is a group of statistical methods that use computation to
build models from large amounts of data to predict a response variable of
interest. Because machine learning methods are capable of providing results
that are highly tailored to the characteristics of individual examples (in this
case, health records), I expect that they will be useful for predicting the
needs of the highly diverse populace.
In this research, Home and Community Care Resident Assessment
Instrument (RAI-HC) health care data are used in collaboration with
Resident Assessment Instrument (RAI) home care service use data.
Patients receiving home health care are authorized with a specified number
of hours by the respective health institution, but patients often receive
fewer or more hours according to the services or help they need. So the
conundrum is finding the perfect allocation of hours that patients need,
according to the help and services required as per their health conditions.
In order to answer and resolve the above-mentioned issues, I have
formulated the following research questions.

4

1.3.1

Research Questions

1. What is the promising method to predict the number of hours per
day on average a client needs in home care?
2. What are good classification methods to classify and dichotomize the
number of hours per day on average a client needs in home care?
3. What are the significant features to predict the usage of home care
services for a client in the near future?

These research questions have been improved and refined after a
thorough consultation with patient partners and domain experts. Keeping
these questions in mind and with patient partners’ and domain experts’
advice, three weeks was identified as the critical time period following an
assessment for any issues captured in that assessment to impact home care
service use. Thus, the experiments described in this research predict home
care service use in the three-week period following a home care assessment.
The target is formulated to be the average hours per day used of each
service from the start date of the assessment.
Figure 1.1 shows the flow chart of the process followed throughout the
experiment. It shows how the data was prepared initially, regression, and
classification algorithms were applied following different cross-validation
techniques, under different classification scenarios.

5

Figure 1.1: Flow chart of the proposed methodologies.

6

Further, it illustrates the sequence of the different machine learning
algorithms and the comparison of results.
The response variable was predicted by using different machine learning
regression algorithms like multiple linear regression, Ridge, Lasso, and
ensemble techniques. Ensemble methods gave comparatively good results,
and in particular, the Random Forests and Bagged Trees provided very
promising results.
For classification, the target was dichotomized on the basis of the
average, median, 75th , 90th , and 95th percentile values and then different
machine learning classification algorithms were deployed to calculate the
accuracy of the classification and area under the ROC curve. Among all the
applied algorithms, the ensemble methods were again promising to show
good classification results. When the mean was used to dichotomize the
target and to make classes, the Random Forests gave an accuracy of 0.84
using the 10-fold cross-validation. For higher quantile values, the 90th and
95th percentiles were used to dichotomize the target, it was noticed that the
accuracy was greatly enhanced. The Bagged Trees and Random Forests
produced promising results with accuracies of 0.92 and 0.95 for 90th
percentile and 95th percentiles respectively. For 95th percentiles
classification, both Random Forests and the Bagged Trees have the largest
ROC AUC score of 0.97. From this, it was concluded that higher quantile
values with the interRAI home care assessment (RAI-HC) data set
generated comparatively good results when compared to centered values
such as mean and median.
This thesis is structured as follows: the next chapter provides some
background information on the methods applied in this research and other
7

relevant concepts. Next, I share information about the data and how it was
organized. The adopted approaches for data preparation and experiments
are then discussed in a chapter on methodology. Following that, I present
my research findings and then discuss some of my observations, discoveries,
challenges, and limitations. The thesis report is completed with a
conclusion; references and program code are also included at the end.

8

Chapter 2
Background

The chapter explains the important terms such as home care and
patient partner engagement in patient-oriented research. It also covers the
past research in the field of machine learning concerning health care.
Different machine learning methods, which form the basis of this research,
are also explained in this chapter.

2.1

Home Care

Home care incorporates a range of services, such as home assistance,
that assist clients in remaining as self-sufficient as possible in their own
homes. Community health workers provide home support services to clients
who need personal assistance with the activities of daily living (ADL)
[Ministry of Health, 2017]. These ADLs include:
9

• mobilization,
• nutrition,
• baths,
• lifts, and
• grooming and toileting.
Ministry of Health [2017] states that when needed, home support
services may include safety maintenance chores in addition to personal
assistance. Clean-up, laundry or clothing, and food preparation are
examples of these activities. Furthermore, health care workers may be
assigned by health care experts to do specialized nursing and rehabilitative
activities.

2.2

Health Care and Machine Learning

As a core component of health service provision, there is growing
interest in home care as an application field for machine learning, as can be
seen in recent scientific literature. Zhu et al. [2014] present three examples
(using KNN, lasso regression, and Random Forests) to demonstrate how
machine learning can fulfill multiple roles in home care research and clinical
decision making, producing both predictions as well as explanatory insights.
Cheng et al. [2015] applied lasso regression and Random Forests to find
characteristics represented in RAI-HC data that have the potential for
predicting the need for rehabilitation services. Jones et al. [2018] worked on
predicting emergency department and hospital use by applying ensemble
10

methods and neural networks with multiple health databases. In their case,
gradient boosting was the most effective method. Finally, Veyron et al.
[2019] used several machine learning methods on functional status data
collected by home care aides to predict emergency department visits,
finding that Random Forests was the most effective method. Notably, these
last two studies considered both regression and classification methods for
predicting the outcomes of interest. As a group, these projects demonstrate
that many different machine learning methods have the potential for
effective use in home care, and the consequent need to consider and
compare multiple methods when developing a new study.
Mohammed et al. [2020] showed how to deal with imbalanced classes in
the data set. They applied resampling algorithms on publically available
imbalanced data from Kaggle. Their main discovery was that oversampling
outperforms undersampling for different classifiers and results in better
scores in distinct evaluation matrices, the reason behind it is that the
methodology of undersampling leads to the information loss [Mohammed
et al., 2020]. In Jeatrakul and Wong [2009], a comparison of different neural
network approaches is presented for binary classification. Back propagation
neural networks (BPNN), radial basis function neural networks (RBFNN),
general regression neural networks (GRNN), probabilistic neural networks
(PNN), and complementary neural networks (CMTNN) were the five
techniques that were used to compare the classification performance. The
comparison is relying on three benchmark data sets taken from the
University of California at Irvine’s machine learning repository. When
compared to strategies used to solve binary classification issues, the results
demonstrate that CMTNN generally produces better classification results.

11

This study is a comparison of various standard machine learning
methods and related techniques, examining how they could be applied to
consider a real-world problem using complex health data. This included
both classification as well as regression algorithms. I will provide a short
description of the methods here.

2.3

K-Nearest-Neighbors

It is a machine learning algorithm for regression and classification. The
rationale behind KNN is to use distance to locate the most likely value of
the target feature. For the motive, the k closest neighbours are found to the
example under consideration. In classification, the most popular class
among those neighbours is deemed as the predicted class. It is a
non-parametric strategy and does not use any parametric distribution. It is
a type of classification that entirely relies upon instances, meaning it only
takes the available data into consideration with no generalization occurring.
It is also known as a lazy learning algorithm since all of the steps and
processes of the algorithm are performed during the query (in principle). It
means the algorithm does not mandate any preprocessing. It usually works
well with data having low dimensionality but its efficacy decreases when the
dimensions of the data are large. In cases when the dimensions are large,
principal component analysis (PCA) may resolve the issue [Jain, 2021].
KNN uses different metrics to measure distance, such as Euclidean
Distance, Manhattan Distance, Chebyshev Distance, Minkowski Distance,
and Mahalanobis Distance. For this research, the distance metric used was
Euclidean for all of the experiments. Euclidean distance is given by the
12

formula,
D(x, y) =

s
X

|xi − yi |2 .

i

The value of K is optimized using distinct cross-validations.

2.4

Decision Tree and Ensemble Learning

Decision trees are dominant and influential tools for classification and
regression. It is a popular machine learning algorithm that is simple and
easy to implement. It is also known as a common algorithm and is
well-known by the name Classification and Regression Trees (CART). The
algorithm has the potential to deal with the data possessing high
dimensionality and can work with numerical as well as categorical data.
Decision tree models are very easy to comprehend and interpret, hence they
are extremely useful and congruous for exploratory data discovery for
information. In the process of decision tree formation, recursive division of
the data takes place, until the target class variable in each division is as
homogeneous as possible.
The performance of decision trees can be enhanced with suitable
attribute selection. So, variable selection plays an important role in
decision tree construction. A simple decision tree may be unable to classify
or predict the target at a depth of 3, 4, or 5. Increasing this depth or using
a different combination of trees altogether might give better results. Hence,
ensemble methods are other techniques that came into the role.
In the paradigm of ensemble learning, numerous estimators are
trained to foresee the performance of the models, but the accuracies of
13

these models are not necessarily good enough on their own. These
estimators are classified as weak estimators. However, when these
estimators are collectively trained on a model and used together as a group,
it generates a robust model with outstanding performance. So, with the
technique of ensemble methods, multiple estimators are used to establish a
better predictor by aggregating distinct models in a way to make one model
with better performance.
There are two essential aspects to be considered: low variance and low
bias model for promising prediction. In addition to that, the choice of
degree of freedom must be done by assessing the two cases, first deterring
high variance and second, maintaining the robustness of the model [Rocca,
2021]. The bias-variance trade-off of a model is resolved with the help of
ensemble methods. Under ensemble methods, there exist two major types
of meta-algorithms that aim at combining weak learners: averaging and
boosting.
To serve our purpose and come up with better accuracy, both averaging
and boosting strategies are used. In averaging, construction and
aggregation of multiple models or classifiers take place. Bagged Trees and
Random Forests fall into this category. The methods of Adaboost and
Gradient Boosting come under boosting, in which multiple weak models are
used collectively and iteratively to generate an improved model. The
accuracy scores are recorded to measure the performance.

14

2.5

Multiple Linear Regression

Multiple linear regression was the first machine learning regression
algorithm applied to predict the response variable, the average hours per
day of services used by home care clients. The matrix equation of multiple
linear equations is:
Y = Xβ + ϵ.
Here, Y represents the response vector, X represents the design matrix
(containing vectors of predictors along with a column vector of 1s, to
accommodate the intercept term) and ϵ represents the error vector. The
assumption of the error vector is that it is normally distributed with a
mean 0 and a constant variance σ 2 [Abraham and Ledolter, 2006]. Errors
are independent and identically distributed, meaning that ϵi and ϵj have
cor(ϵi , ϵj )= 0, for all i ̸= j. This assumption implies that Y has a normal
distribution with mean Xβ and the constant variance σ 2 . Y is independent
and identically distributed, that is, for any yi and yj in Y, the cor(yi , yj ) =
0, where i ̸= j [Abraham and Ledolter, 2006]. The above multiple linear
equation can also be written as:
y = β0 + β1 x1 + β2 x2 + β3 x3 + . . . + βp xp + ϵ.
Here, β0 is the intercept and the other βj are the slopes. The main task is
to find these coefficients and put them into this equation to predict the
response. The independency of the predictor variables is imperative,
meaning that they should not be correlated. That is, one predictor variable
can not be written as a linear combination of the other variables.
To check whether the algorithm is better fitted or not, these steps are
taken: first, examining the residual vs. fitted plot, and secondly, checking
15

the normal QQ-plot.
Residual vs. Fitted Plot: In multiple linear regression, the residuals
are plotted against the predicted or fitted value X β̂. For the model to be a
good fit, the residuals should not show any discerning patterns against the
fitted values [James et al., 2013]. In other words, the error should have the
same unknown variance, which is also called a homoscedasticity
assumption. The characteristics of good residual vs. fitted plots explained
by Cran.R [2021] are:

1. The residuals are randomly distributed around the 0 line, showing a
linear relationship.
2. The residuals constitute a roughly horizontal zone around the 0 line,
suggesting homogeneity of error variance.
3. There should be no outliers in the residuals.

If there exists a pattern in the residuals, it implies a problem with the
linear model. If the residual vs. the fitted plot indicates that there exists a
visible pattern or a non-linear association in the data, then a simple
√
approach is to apply a non-linear transformation, including log x, x and
x2 [James et al., 2013].
Normality of the residuals: The residuals should be normally
distributed. The QQ-plot of the residual is a good way to check for any
violation of the normality assumption. In this plot, the residuals should fall
close to the expected normal quantiles. If the residuals lie away from the
diagonal line, there is a violation of the normality assumptions. Histograms
are also used to check the normality of the residuals. If the curve drawn
16

with the help of a histogram follows a normal distribution, the residuals
follow the assumptions of normality. The R2 and the mean squared error
are calculated to check the performance of the regression model. The R2 is
given by the formula:
R2 = 1 −

RSS
.
T SS

Where RSS represents the residual sum of squares and TSS represents the
total sum of squares. With the addition of variables in the model, the R2
value tends to increase. To overcome this, the adjusted R2 is calculated. In
addition to that, the mean squared error (MSE) is also calculated to
evaluate a model.

2.6

Ridge and Lasso

Ridge and Lasso are popular regression methods to shrink the
coefficients of the variables. Ridge regression imposes a penalty on the
magnitude of the regression coefficients. As a result, they are forced to
decrease. The task is to minimize a penalized residual sum of squares,
β̂ridge = argmin{
β

n
X
i=1

(yi − β0 −

p
X
j=1

2

xij βj ) + λ

p
X

βj2 }.

i=1

The Lasso regression is analogous to ridge regression but there exists a
P
crucial difference. In ridge, the loss is pi=1 βj2 ; whereas the penalty term in
P
lasso is pi=1 |βj |. The lasso equation is given as
p
p
n
X
X
X
2
β̂lasso = argmin{ (yi − β0 −
xij βj ) + λ
|βj |}.
β

i=1

j=1

i=1

Here, n is the number of data values, and p is the number of predictors.
17

In Lasso, some of the predictors may be penalized to zero. Therefore, Lasso
reduces the size of the set of variables [Hastie et al., 2001].

2.7

Cross-Validation (CV)

In modern statistics, data partitioning is indispensable for evaluating a
model. The process involves recursively partitioning data into training and
test samples. Furthermore, fitting a model on the training sample and
evaluating it on the test sample. The main goal is to evaluate the test error
rate, which is only possible if there is test data. Therefore, a reasonable
solution is to calculate the test error rate from the available training error
rate. But in general, the training error tends to underestimate the test
error.
This problem is addressed by using a validation set. Under this
approach, the observations in a data set are randomly divided into a
training set and a validation set. The model is then fitted on the training
set, which is used to predict the response in the validation set. From the
validation set, the error is eventually calculated. The resulting validation
set provides a reasonable estimate of the test error rate.
The validation set approach has 2 drawbacks:

1. The observations used in the training set are different from those in
the validation set. As a result, the validation error rate can vary
largely.
2. The model is trained only on a subset of observations, i.e., fewer
18

observations, and may perform poorly. This may result in
overestimating the test error rate.

Figure 2.1: K-fold cross-validation (k = 4).

These issues are resolved by adopting the methodology of
cross-validation. In this work, three methods of cross-validation were used:
K-fold cross-validation, Shuffle split cross-validation and Stratified k-fold
cross-validation.
In K-fold cross-validation, the entire set of observations is randomly
divided into k parts (or folds) of equal size. Figure 2.1 shows how the
process of 4-fold cross-validation is executed.
When the first fold is considered as the validation set, the model is
fitted on the remaining k-1 training folds. The trained model is then used
to calculate the MSE on the held-out fold (i.e., the first fold). In the second
iteration, the second fold is used for validation when the remaining folds are
used for training. The process continues for a total of k times. A key aspect
to note is that in each iteration, (1/k)th portion of the data is used for
19

validation, and (k − 1/k)th proportion is used for training. The process
provides k estimates of the test error such as M SE1 , M SE2 , ..., M SEk .
The cross-validation estimate [James et al., 2013] is given by:
Pk
CVk−f old =

i (M SEi )

k

.

The process of cross-validation involves randomness in splitting the
data values, which is essential to evaluate the model performance and
testing. James et al. [2013] shows that CV is one of the best techniques to
prevent over-fitting, especially when the data size is small. CV also offers
several approaches to handle the issue of imbalanced classes.
Shuffle split cross-validation method draws training and test sets
randomly instead of forming folds. The technique is very useful for large
data sets and often appears computationally feasible. It is also known as
Monte Carlo cross-validation [James et al., 2013]. The number of iterations
performed is decided by the experimenter, and the results are averaged over
the number of iterations. The percentage of data in the training and test
splits is independent of the number of iterations. This CV is not
particularly helpful while working with an imbalanced data set.
Stratified k-fold cross-validation technique is useful for rare-class
classification. The stratified CV maintains target class proportions. This
technique ensures that the proportion of classes is balanced in each fold. In
each fold, it maintains the distribution (mean, variance, and among others)
of the original data and is beneficial over k-fold cross-validation [Singh,
2022]. Stratified CV is very helpful for imbalance class classification. Figure
2.2 illustrates the stratified k-fold cross-validation.

20

Figure 2.2: Stratified k-fold cross-validation.

2.8

Quantiles and Percentiles

We used different quantiles to split the data for classification. For the
K th percentile, the lower portion contains k% of the data while the upper
portion contains the rest of the data (100-k)% [Triangles, 2015].
Tackling the imbalanced class problem is of importance while training a
machine learning algorithm. The classification problem is considered to be
imbalanced when the distribution of the training data is skewed. In such
conditions, a classifier is usually biased towards the majority class, and the
machine learning algorithms can fail to detect the minority class.

21

2.9

Random Sampling

Random sampling is another technique to tackle imbalanced class
problems. Two strategies are undersampling and oversampling. Figure 2.3
illustrates the techniques of oversampling and undersampling.
Undersampling is the process in which the samples of the majority class
are reduced.

Figure 2.3: Over-sampling and under-sampling.
Oversampling is the process in which the samples of the minority
class are duplicated. This method also has some weaknesses, as
oversampling can cause overfitting, while undersampling can result in the
loss of information [Kumar, 2020]. This random sampling is also known as
the naı̈ve technique because when it works, it does not have any
assumption over the data [Kumar, 2020].

2.10

Evaluation Metrics

To evaluate the performance of the model for classification, confusion
22

matrix, and ROC AUC curve are often used. For regression, the R2 value
and mean squared error are usually calculated for evaluation.

2.10.1

Confusion Matrix

A Confusion Matrix is used to evaluate the results of binary
classification. The structure and the functionality of the confusion matrix
are elaborated in Figure 2.4.
A confusion matrix provides a good explanation between the predicted
values and the actual values. It is also known as the error matrix. Its
layout helps in depicting the performance of the applied machine learning
algorithm. It has two dimensions: the actual value of an example and the
predicted value by the algorithm. For binary classification, this results in
four possibilities: an example is either positive or negative, and it can
either be correctly or incorrectly classified. The positive cases are
represented as P, whereas the negative cases are given by N. The total
population is defined as P+N. Here, the columns define the actual
condition, whereas the rows explain the predicted condition. These are the
metrics that are calculated from the confusion matrix.
True Positive (TP): an algorithm predicts a positive result, which is
actually positive. It means that the test correctly detected the presence of a
trait or condition.
True Negative (TN): an algorithm predicts a negative result, which
is actually negative. It means that the test correctly detected the absence
of a trait or condition.
False Positive (FP): an algorithm predicts a positive result that is

23

actually negative. It means that the test falsely detected the presence of a
trait or condition.
False Negative (FN): an algorithm predicts a negative result that is
actually positive. It means that the test falsely detected the absence of a
trait or condition.

Figure 2.4: Confusion matrix for the binary classification problem.

A false positive is also known as type I error, which is defined as
rejecting true null hypothesis. A false negative is also known as type II
error, which is defined as accepting a false null hypothesis. There are also
some further statistical metrics which are used to evaluate the performance
of a classifier [Fawcett, 2006]:
Sensitivity/recall/true positive rate (TPR) is the proportion of
the positive class that was correctly classified.
TP
TP
=
.
P
TP + FN
24

Specificity/ selectivity/ true negative rate (TNR) is the
proportion of the negative class that was correctly classified.
TN
TN
=
.
N
TN + FP
Precision or positive predictive value (PPV) is the ratio of
correctly predicted positive classes to the total predicted positive classes.
TP
.
TP + FP
Accuracy(ACC) is the ratio of correct predictions to the total
number of predictions.
TP + TN
.
P +N
F1-Score is the harmonic mean of precision and sensitivity.
2P recision ∗ Sensitivity
.
P recision + Sensitivity

2.10.2

ROC AUC Curve

The ROC AUC curve is an assessment metric for classification at
various discrimination thresholds [Lars et al., 2011]. Receiver Operator
Characteristic (ROC) is the probability curve, and AUC is the area under
the ROC curve that estimates the degree of separability. It indicates how
well the model can distinguish the two classes. The higher the AUC, the
better the model [Narkhede, 2022].

25

Figure 2.5: ROC AUC curve. The x-axis represents the true
positive rate (TPR) while the y-axis denotes the false positive
rate (FPR). The blue area under the red curve (ROC) is AUC.

If the AUC is 1, it means the model is excellent and able to classify
classes perfectly. A poor model has an AUC score close to 0. A 0 AUC
score reciprocates the results, it predicts class 1 as class 2 and vice-versa. A
model that has a 0.5 AUC score is not capable of classifying the classes. To
calculate AUC, the True Positive Rate (TPR)/Sensitivity is plotted against
the False Positive Rate (FPR)/(1-Specificity), where TPR is plotted on the
y-axis and the FPR is plotted on the x-axis (Figure 2.5).

2.10.3

R2 and MSE

The coefficient of determination is denoted by R2 . The R2 value is the
statistical measure that computes the percentage of variance in the

26

dependent variable around the mean that is explained by the model.
Pn
2
RSS
2
i=1 (yi − ŷi )
P
R =1−
=1−
.
n
2
T SS
i=1 (yi − ȳ)
Here, yi is the response value and ŷi is the predicted value. ȳ is the
mean of the response variable and n is the number of data points [Lars
et al., 2011].
The Mean squared error (MSE) provides a measurement of how far the
predictions are from the actual values on average.
n

1X
M SE =
(yi − ŷi )2 .
n i=1
Both R2 and MSE are used to check the performance of the regression
model.

27

Chapter 3
Data
I used home care data collected from 2010 to 2019 in the Interior
Health Region of British Columbia, Canada. This includes two data sets.
The first is InterRAI Home Care assessment data (RAI-HC), which is used
by health professionals to record the current status of a home care client.
The portion of the data shared consists of 837,536 records, each with 423
variables, which are mostly categorical in nature. The variables cover many
aspects of a home care client’s characteristics and statuses, such as health
conditions, activities of daily living (ADL), independent activities of daily
living (IADL), availability of both formal and informal caregiving, mood,
and socialization. The second data set is Service Data which covers home
care usage by clients. It has 8 attributes. This project was a collaboration
with researchers at UNBC. The UNBC Research Ethics Board (REB)
determined that REB approval was not necessary for this project because it
was entirely secondary data that had been previously linked and
de-identified by the health authorities before storing it on a secure UNBC

28

server.
The primary values of interest to my work are (1) hours of service
allocated to a client after assessment, and (2) actual hours of service used
by a client. While a perfect assessment would result in these two numbers
always being the same for a given client, in practice they can differ greatly
depending on whether a client’s actual needs over time were greater or
lesser than when assessed. For this particular phase of the project, I
focused on (2), the actual hours of service used.
Based on input from healthcare experts and patient partners, I
identified the three weeks following an assessment as the critical time period
of interest for prediction. For each assessment, I calculated the average
hours per day of home care service use following the assessment. If a course
of home care service began before the assessment and/or was completed
after the three-week time limit, only the portion that fell within the
three-week time period was included. The mean value for all assessments
was 0.79 hours per day. For regression methods, predicting such a value in
the target feature was the goal. For classification, the task was to split the
service use into two categories: high service users and low service users.

3.1

Basis for Splitting the Target

There are multiple bases used to split the target into categories, such
as the mean, median, 75th , 90th , and 95th percentiles.
High: assessment of home care clients above the threshold, i.e., their
average hours of home care service use per day was greater than or equal to

29

the threshold.
Low: assessment of home care clients below the threshold, i.e., the
remainder of the records that did not fall into the High category.
When the mean was used to dichotomize the target, It did not show a
balanced distribution (Figure 3.1). Among the two classes, the ratio of Low
to High was approximately 5:3. So cross-validation and balancing of the
classes were employed before training the model.

Figure 3.1: Formation of classes when the response variable “Average Hours Per Day” is dichotomized using mean (“Low” is below
the mean; “High” is above the mean).

The boxplot in Figure 3.2 gives the information about the target and
the respective outlier values. After plotting the histogram (3.3), it was
noticed that the data is right-skewed and most of the hours in the response

30

variable are having very small quantitative values. Thus, almost all of the
records lie close to the left corner of the histogram.

Figure 3.2: Boxplot of the response variable (Average Hours Per
Day). Note the outliers which are more than 24 hours (per day).

After reformulating the histogram to focus on the small counts of hours
on the x-axis as shown in figure 3.4, I observed that the response variable
had values overwhelmingly below 5 hours.

31

Figure 3.3: Histogram of the response variable (Average Hours
Per Day).

Figure 3.4: Histogram of the response variable after zoomed in.

32

The table 3.1 shows the information of the target, the mean value of
the response variable is 0.79 and the median value is 0.55. And it is evident
from the skewness value of 14.01 that the distribution is right skewed. The
largest value is 84.92 average hours per day which is clearly an error
(because there are only 24 hours in a day). I deleted 5 rows with average
hours of uses of home care services above 24 hours.
Table 3.1: Summary of the response variable
Mean

0.79

Standard Error

0.00

Median

0.55

Mode

0.50

Standard Deviation

1.17

Sample Variance

1.37

Kurtosis

485.36

Skewness

14.01

Range

84.92

Minimum

0.00

Maximum

84.92

Sum

125635.87

count

159693

Largest(1)

84.92

Smallest(1)

0.00

Confidence Level(95.0%)

0.01

When the target was divided into classes by using 90th percentile value
(1.55 average hours per day), the class variable did not have a balanced
33

distribution.
The pattern of the targeted bins is visualized in Figure 3.5 is on the
basis of the 90th percentile (1.59 Avg. Hrs./Day). Among the two classes,
the ratio of Low to High was approximately 7:1. So cross-validation and
balancing of the classes were used before training the model. Similarly, the
95th percentile (2.0 Avg. Hrs/Day) was used to split the class and follow
the same process.

Figure 3.5: Formation of classes when the response variable
“Average Hours Per Day” is dichotomized using 90th percentile
(“Low” is below the 90th percentile; “High” is above the 90th percentile).

3.2

Data Cleaning and Preparation

The Data set has 132 description variables out of 423 given variables.
34

291 features are left after the removal of 132 description variables. Finally,
61 variables are selected through the Recursive Feature Elimination method
with Random Forests, correlation technique, and the regularisation method
of Lasso. The Flow chart 3.6 details the information about the data
cleaning and preparation.
The initial number of rows in the data is 837,536. After removing the
null value, the resulting rows are 794,804. There exist 5 outliers in the
response variable “Average hours Per Day”. After removing these outliers,
the final number of rows are 794,799.

Figure 3.6: Flow chart of the data cleaning and preparation.

35

Chapter 4
Methodology
In this chapter, connection to the database, exploratory data analysis,
and the relevant machine learning methodologies are explained.

4.1

Environment Setup for Acquiring and Processing Data

To accommodate the large data and computation, the main
programming language that was considered was Python. Python is a free
and open-source language. Its simplicity does not limit its functional
abilities. It is a high-level programming language with an enormous
existing community. In this project, the Python libraries such as Numpy,
Pandas, Matplotlib, Keras, Scikit-Learn, and others were used. In the
project, the database was accessed through Microsoft Structured Query
Language (MS-SQL). MS-SQL was used to maintain, modify, and
36

Figure 4.1: Connection of Python with MS-SQL.
manipulate the relational database.
MS-SQL was used to query the various database tables, manipulate the
data, and join tables. After this, the main task was to connect the Python
code with MS-SQL to access the data for computation and analysis. The
task was completed by using the pyodbc module. Pyodbc is a free Python
package that makes it easy to connect to ODBC databases (Figure 4.1). It
provides a quick way to link Python programs to data sources using an
ODBC driver. The target (Average Hours Per Day) in the research was
calculated by following consecutive steps of querying the data. Joining the
Home Care table with the Service data table, and then using the column
calculations over the joined and resulting table.

4.2

Data Wrangling, Visualization, and Exploratory Analysis

To understand the data, viewing the patterns and doing the statistical
analysis are very helpful. The process of sorting the massive data set and
making it easily accessible for analysis falls under the topic of data
wrangling. The provided data set was very large, so data wrangling was

37

adopted to make the data readily available for use. This was done by
converting the data into a data frame and making interpretation easily
assessable through the joining of relevant tables. The data were visualized
using the Python libraries Matplotlib and Seaborn. Here, the plot of the
response variable clearly depicts that it was heavily right-skewed with
skewness of 14.01. The box plot provided a clear picture that there were
some outliers. Moreover, I also noticed that some entries were erroneous as
they did not follow the standard pattern. I contemplate that these entries
might have been wrongly entered into the database by human or system
error. Then, the rules of exploratory data analysis (EDA) were employed to
see what the data revealed statistically. Here, attention was paid to the
types of the variables. Some variables are quantitative in nature, whereas
some are categorical. After producing a statistical summary of the data,
the measurements of the central tendencies and the outliers were noted for
quantitative variables, whereas the class count was noted for categorical
variables. This provides a clear view of the distribution of the data and
how balanced (or imbalanced) the classes are. The EDA reconfirmed the
distribution of the response variable and its skewness, which were earlier
noticed through the data visualization.

4.3

Data Preparation and Cleaning

After noticing that there are outliers in the data, the vital task is to
remove the outliers. Also, some of the entries in the data had null values, so
the rows with null entries were removed. In the process of data wrangling
and visualization, it was noticed that the data had some missing values.

38

These were not actually missing but simply a single space character (‘ ’)
used to represent a 0. This small but prevalent issue was a challenge to
identify. However, I overcame it by replacing these space entries with
zeroes. For this, I confirmed with the domain expert and the patient
partners that this was an appropriate solution.

4.4

Dichotomizing the Response Variable using Quantiles and Percentiles

To understand the distribution of the target, I used boxplots and
histograms. Boxplots gave complete information about the median,
different quantile values, and outliers. As shown in Figure 3.2, the largest
outlier value is 84.92, which is erroneous.
With the help of the histogram (Figure 3.3), it became evident that the
data was highly right-skewed with a skewness value of 14.01, where mode <
median < mean. After noticing that there were five values that were
outliers in the target, it became important to remove these outliers.
With expert advice, I used values up to 24 hours of the target. Figure
4.2 and 4.3 show the histogram and the boxplot of the target with no
outliers.

39

Figure 4.2: Distribution of the response variable after removing
outliers.

Figure 4.3: Boxplot of the response variable after removing outliers.

40

Table 4.1:

Values after and before removing outliers from

the response variable “Average Hours Per Day” when 0 ≤
AverageHoursP erDay ≤ 24.
Statistical Measures

Values After Removing Outliers

Original Values

Mean

0.79

0.79

Median

0.55

0.55

75th percentile

1.04

1.04

90th percentile

1.59

2.05

95th percentile

2.0

2.50

Some of the statistical functions were changed after removing the
outliers from the target whereas, the mean, median, and 75th percentile
values remained unchanged. Table 4.1 shows the new values of the
statistical measures after removing outliers. All the values are in units of
Average Hours/ Day.

4.5

Encoding of the Data:

The majority of the variables in the data set are categorical in nature.
Some categorical variables have six levels, whereas some have four. For
example, the variables: AllOtherRespiratoryTreatments,
MedicationsbyInjection, ExerciseTherapy, OccupationalTherapy,
MedicalAlertBraceletorElectronicSecurityAlert, DayCentre, SkinTreatment,
etc., are categorical in nature with four levels defined as:
0 = Not applicable
1 = Scheduled, full adherence as prescribed
41

2 = Scheduled, partial adherence
3 = Scheduled, not received
While using a qualitative variable to run any machine learning
algorithm, there is a possibility to mislead the training of the data. For
instance, if the algorithm of linear regression is applied, there might be a
chance that the relation is linear for one level but quadratic or cubic for
other levels [James et al., 2013]. This will result in fallacious training of the
algorithm, which may affect the prediction. To overcome this issue of
erroneous learning, the concept of one-hot encoding is introduced. In this,
if a categorical variable has 2 levels, then a binary variable is created for
each possible value. If there are 3 levels, then 3 binary variables are formed.
To perform this task, the “OneHotEncoder” module of the Python sklearn
library was used [Lars et al., 2011]. The process generates a dummy
variable for each categorical value and stores the results in a sparse matrix.
By default, the encoder automatically generates dummy variables based on
the unique values of each of the categorical variables.

4.6

The Process of Feature Selection

The original data had 423 variables. It was imperative to do variable
selection to find the predictors that were good in predicting the response
variable. There are several methods available in machine learning which
help in the process of selecting the important features. To select the
features, I used the correlation technique, recursive feature elimination, and
the regularisation method of lasso [Lars et al., 2011].

42

4.6.1

Finding Independent Variables

The work on this research began with multiple linear regression. With
the provided 423 variables, it was necessary to identify which variables were
linearly independent. Hence, the correlation was calculated for the whole
design matrix. Only a single variable was selected from the group of highly
correlated variables. This process provided a set of independent variables.
The variables that possessed a high correlation with the response in this set
were selected. This ultimately brought down the size of the provided
features.

4.6.2

Recursive Feature Elimination with Random Forests

Recursive feature elimination minimizes the model complexity by
deleting variables one by one until the number of features remaining is
ideal. This technique is provided by the scikit learn library of Python [Lars
et al., 2011]. I used RFE with Random Forests to select features. The
reason for using Random Forests was that the R2 was large as compared to
other methods. All regression algorithms possess feature weights or
coefficients which multiply with their respective feature value in order to
predict a response. The same goes for Random Forests. The weights or
coefficients received after training the algorithm are sorted in order. To
reduce the model complexity, the values of the weights that were close to
zero were eliminated by introducing a threshold with the help of the
recursive feature elimination method since their extremely low value (close
to zero) contributed a little to the model [Tuychiev, 2021].

43

4.6.3

Feature Selection with Regularization Method
of Lasso

Lasso is a regularisation method. The penalty term used in the cost
P
function is pj=1 |βj |, where βj , represents the coefficients of the p features
[Hastie et al., 2001]. There exists a term λ (tuning parameter), which is
known as the learning rate. If the value of the λ is small, the cost function
behaves analogously to MSE. In this case, it reduces the effect of
regularization. However, if the value of the λ is large, it imposes the
regularisation effect, reducing some of the coefficients to zero. After
training the data using the lasso algorithm, it assigns some coefficients (βs)
to the regression equation. The values of these coefficients directly affect
the prediction of the response. Hence, a threshold value for βs was chosen.
Thus, only the variables with values that were larger than the threshold
value were kept in order to reduce the model complexity. In this way, the
Lasso method helped in selecting the variables.
Further, to cross verify the results, the domain expert and patient
partners’ advice was taken in deciding whether the selected variables were
important or not from an application perspective. The mentioned steps
help in selecting the 61 features out of the total provided features.

4.7

Machine Learning Cross-Validation Methods

Cross-validation provides several methods to improve the issue of
44

over-fitting when the classes are imbalanced. K-fold, Stratified K-fold, and
Shuffle-Split cross-validation techniques were used for this work. While
dividing the data into training and test sets, the ratio of 80:20 or 90:10 of
the data was used. The k-Fold cross-validation was used with k value 10 in
order to balance the bias-variance trade-off of the data [James et al., 2013].

4.8

Machine Learning for Regression Problems

To predict the target variable, multiple linear regression was first used.
After that, the shrinkage methods of Ridge and Lasso were employed,
followed by the Decision Tree and Ensemble methods. A decision tree uses
only one variable to split the whole data at the root node. It means that it
is only giving importance to the 2 or 3 variables that are at the top nodes
to split the whole data. Thus, it undermines the importance of other
variables when the variable size is large. Moreover, visualizing the decision
tree with large depth is complex when the features are large in number. In
order to improve that issue, the approach of ensemble methods was applied.
To execute all the mentioned algorithms, the scikit-learn library of Python
was used.

45

4.9

Machine Learning for Classification Problems

Machine learning methods KNN, decision trees, logistic regression, and
ensemble methods were used for classification. I observed that the KNN
algorithm suffered from the curse of dimensionality. It was computationally
inefficient. The Euclidean distance matrix was used to run the algorithm.
As the KNN algorithm suffered from the curse of dimensionality, it was
dropped from the analysis.

4.10

Computational Resources

These are the specifications of the server used for running the program
code:
• CPU: four Xeon E7-4850 with 40 cores running at 2.00 GHz
• RAM: 256 GB
• Operating System: Windows Server 2019
It is a powerful server system that allows running many programs in
parallel efficiently. In order to reduce the processing time, the programs
were broken into chunks and executed in parallel because the individual
cores are only 2 GHz. However, there are 40 of them available along with a
large amount of RAM. Specifically, 24 programs were executed in parallel
and tested by evaluating the model performance.
46

Chapter 5
Results
In this chapter, I report the results of the various machine learning
algorithms. Initially, the selected features are provided followed by the
findings for regression and classification.

5.1

Selected Features

The given data set has 423 attributes. There are 132 description
variables which are removed. On the remaining 192 variables, the algorithm
of Recursive Feature Elimination (RFE) was applied with Random Forests .
RFE helped in selecting 53 variables out of 192 variables. Twenty seven
variables showing high correlation with the response were selected from 192
variables. The regularisation technique of LASSO also helped in selecting
three important variables. Further, these 27, 53, and 3 variables were
presented to the domain experts and patient partners. These selected

47

attributes were cross-verified and assessed based on realistic health
scenarios. The whole process allowed me to select 61 features out of 423
original attributes. The list of 61 selected features is given below.
1 . Hearing
2 . MakingSelfUnderstood
3 . AbilityToUnderstandOthers
4 . SadMoodRecurrentCryingTearfulness
5 . LengthOfTimeAloneDuringDay
6 . LiveWithClientSecondary
7 . RelationshipToClientPrimary
8 . HoursOfInformalHelp5WeekDays
9 . HoursOfInformalHelp2WeekendDays
10 . MealPreparationSelfPerformance
11 . MealPreparationDifficulty
12 . OrdinaryHouseworkDifficulty
13 . ManagingFinancesSelfPerformance
14 . ManagingFinancesDifficulty
15 . ManagingMedicationsSelfPerformance
16 . ManagaingMedicationsDifficulty
17 . PhoneUseSelfPerformance
18 . PhoneUseDifficulty
19 . ShoppingSelfPerformance
20 . ShoppingDifficulty
21 . TransportationSelfPerfromance
22 . TransportationDifficulty
23 . MobilityInBed
24 . Transfer
48

25 . LocoMotionInHome
26 . LocoMotionOutsideHome
27 . DressingUpperBody
28 . DressingLowerBody
29 . Eating
30 . ToiletUse
31 . PersonalHygiene
32 . Bathing
33 . ModeOfLocoMotionIndoors
34 . ModeOfLocoMotionOutdoors
35 . StaminaDays
36 . BladderContinence
37 . BowelInContinence
38 . RenalFailure
39 . CoronaryHeartDisesase
40 . Hypertension
41 . IrregularlyIrregularPluse
42 . DementiaOtherThanAlzheimers
43 . Parkinsonsism
44 . Arthritis
45 . HipFracture
46 . OtherFractures
47 . Osteoporosis
48 . Glaucoma
49 . AnyPsychiatricDiagnosis
50 . PainFrequency
51 . FallsFrequency

49

52 . Swallowing
53 . BetterOffInOtherLivingArrangment
54 . HomeHealthAidesDays
55 . HomeHealthAidesHours
56 . HomemakingServiceHours
57 . MealsHours
58 . PhysicalTherapyHours
59 . MedicalAlertBraceletorElectronicSecurityAlert
60 . NumberofMedications
61 . Anxiolytic

5.2

Results for Regression Problem

For regression, the R2 value and the mean squared error (MSE) are the
two important measures to check the performance of the algorithms. The
R2 value is the statistical measure that computes the percentage of variance
in the dependent variable that is explained by the independent variable or
variables in the regression model. Mean squared error (MSE) provides a
measurement of how far the predictions were from the actual values on
average. Figure 5.1 and 5.2 present the R2 and the MSE values respectively
of the regression methods to predict the target.

50

Figure 5.1: R2 for the regression problem, which was used to
predict the target (Average Hours Per Day).

Figure 5.2: MSE for the regression problem, which was used to
predict the target (Average Hours Per Day).

51

Table 5.1: Prediction of the first 10 values of the response variable “Average Hours Per Day” using Random Forests. The bold
predicted values are close to the actual values.
Index

Actual Values

Predicted Values

0

0.245

0.198

1

0.540

0.359

2

2.258

2.265

3

1.496

1.487

4

0.975

0.501

5

0.690

0.638

6

1.596

1.581

7

0.597

0.439

8

2.258

2.265

9

2.258

2.208

Table 5.1 shows the actual and predicted values of the response
variable. The Random Forests method was used to predict these values.
The standard deviation and confidence interval (95%) of the response
variable are 1.103 and 0.002, respectively. Taking these facts into account,
the predicted values which were within one decimal place of the actual
values are presented in bold (Table 5.1).
The ensemble methods Bagged Trees and Random Forests had
comparatively high R2 values, and their MSE values were also low. Hence,
these were the promising performers among the regression methods. It is
evident from Figure 5.1 that the Random Forest method has the largest R2
(53%). The ensemble methods are giving comparatively improved results in

52

regression since they are uniting the results of numerous basic models to
enhance the prediction power.
To check the performance of the regression models, the mean squared
errors (MSE) are also calculated. The values of the MSEs are shown in
Figure 5.2. It is clearly evident from the plot that the MSE values of the
ensemble methods are comparatively low as compared to any other
algorithm for regression.

5.3

Results for Classification Problems

The response variable, Average Hours Per Day was dichotomized using
the mean, median, 75th percentile, 90th percentile, and 95th percentile of the
response variable. The classification results are given in the following
sections.

5.3.1

Classification Results for Mean

The mean of the response variable is 0.79 average hours per day. It
was used to dichotomize the target, which resulted in an imbalanced
classification. The ratio of the classes was 5 : 3.
The Gradient Boosting, Bagged Trees, and Random Forests were the
most promising predictors with the highest accuracy. If the results are
compared in terms of cross-validation, the 10-fold cross-validation works
well over stratified 10-fold cross-validation and shuffle split cross-validation.
Bagged Trees, Gradient Boosting, and Random Forests showed an accuracy
53

of 0.838, 0.840, and 0.843, respectively. With stratified cross-validation,
0.796 was measured as the highest accuracy for the Gradient Boosting.
Whereas with shuffle split cross-validation, Gradient Boosting came up
with a 0.810 accuracy.
Figure 5.3 shows that the classifiers such as Decision Tree and Logistic
were not the best fit for classification for the given data set based on
accuracy. It also illustrates that the 10-fold cross-validation performs well
when compared to other cross-validation techniques. The Random Forests,
Gradient Boosting, and Bagged Trees have comparatively better accuracy
of approximately 84% with 10-fold cross-validation.

Figure 5.3: Accuracy of classification algorithms when the target
is dichotomized using mean.

54

Figure 5.4: KNN with Shuffle-Split CV gives its highest accuracy
of 0.787 when the K value is 5 for the given range of K.

Figure 5.5: KNN with 10-fold CV gives its highest accuracy of
0.783 when the K value is 5 or 9 for the given range of K.

55

KNN, when run with a limited amount of data (only 1000 rows) showed
the accuracy of 0.787 with Shuffle-Split cross-validation when the value of
K was 5 (Figure 5.4).
It is also evident from Figure 5.5, at the K-values of 5 and 9 with the
same initial 1000 rows, KNN gives almost the same accuracy of 0.783 when
executed with 10-fold cross-validation. But as I increased the dimension of
the data, the efficacy of the KNN algorithm decreased and its running time
increased.
Cross-validation played an important role to overcome the problem of
over-fitting in imbalanced class problems. 10-Fold cross-validation was the
one that improved the accuracy among all the methods. Hence 10-fold
cross-validation gave the highest accuracy when the target was
dichotomized using the mean.
10-fold cross-validation was generating improved results with ensemble
techniques as it helped in developing the less biased model in comparison to
other methods. The reason behind its better performance is that it gives
chance for every observation in the data set to show up in the training and
test sets.
Since the classes were imbalanced, to check the performance of the
model ROC AUC scores were calculated. Fig. 5.6 shows the ROC AUC
scores for the different machine learning algorithms. The Random Forests
and the Bagged Trees have comparatively good ROC AUC scores of 0.931
and 0.930 respectively.

56

Figure 5.6: ROC AUC scores of classification when the target is
dichotomized using the mean.

The top 5 important variables selected using Random Forests when the
target was dichotomized using mean, are:

1. HomeHealthAidesHours
2. HoursOfInformalHelp5Weekdays
3. HoursOfInformalHelp2WeekendDays
4. NumberOfMedications
5. ModeOfLocomotionOutdoors
57

These variables were selected on the basis of their weightage of
importance using Random Forests.

5.3.2

Classification Results for Median

The median of the response variable is 0.55 average hours per day. It
was used to dichotomized the target. The accuracies for distinct machine
learning methods are provided in Table 5.2.
Table 5.2: Classification evaluation using accuracy when the target is dichotomised using median. The highest accuracies are
highlighted in bold.
ML Classifiers/CV

Stratified 10 Fold CV

10-Fold CV

Shuffle-Split CV

Decision Tree

0.600

0.598

0.599

Logistic Regression

0.603

0.602

0.605

Adaboost

0.620

0.619

0.623

Gradiant Boosting

0.768

0.652

0.654

Bagged Trees

0.789

0.788

0.787

Random Forests

0.789

0.788

0.788

Decision Tree with 10-fold and shuffle split cross-validation had the
least accuracy of 0.59. The classifiers such as Adaboost, Logistic and
Boosting had comparatively low classification accuracies. The Bagged Trees
and the Random Forests had the highest accuracy of approximately 79%.
Table 5.3 shows the confusion matrix for the Bagged Trees and the
Random Forest for one fold out of stratified 10 fold cross-validation. The
58

accuracies for the Bagged Trees and the Random Forests are;
(31467 + 31250)/(31467 + 31250 + 8261 + 8501) = 0.789 and
(31463 + 31253)/(31463 + 31253 + 8265 + 8498) = 0.789 respectively.
Table 5.3: Confusion matrix for median classification for one fold

(a) Bagged Trees

(b) Random Forests

Actual

Actual

High

Low

High

31467

8261

Low

8501

31250

Predicted

Predicted

within stratified 10-fold cross-validation.

High

Low

High

31463

8265

Low

8498

31253

The top 5 important variables selected using Random Forests when the
target was dichotomized using median, are:
1. HomeHealthAidesHours
2. HoursOfInformalHelp5Weekdays
3. HoursOfInformalHelp2WeekendDays
4. HomeHealthAidesDays
5. Bathing
These variables were selected on the basis of their weightage of importance
using Random Forests.
ROC AUC scores were used to assess the model’s performance. The
ROC AUC scores for the various machine learning methods are shown in
59

Figure 5.7. The Random Forests and the Bagged Trees had comparatively
good ROC AUC scores of 0.926 and 0.927 respectively.

Figure 5.7: ROC AUC scores of classification when the target is
dichotomized using the median.

60

5.3.3

Classification Results for 75th Percentile

The 75th percentile of the response variable is 1.04 average hours per
day. When this value was used to dichotomize the target, it resulted in the
following target classifications (Table 5.4).
Table 5.4: Formation of classes when the response variable “Average Hours Per Day” is dichotomized using the 75th percentile.
High

198609

Low

596190

For 75th percentile classification of the target, the accuaracies of all the
algorithms were above 0.70 and the least accuracy recorded was 0.74. The
Bagged Trees and the Random Forests had the highest accuracy of
approximately 84% (Table 5.5).
Table 5.5: 75th percentile classification evaluation using accuracy.
The highest accuracies are highlighted in bold.
ML Classifiers/CV

Stratified 10 Fold CV

10-Fold CV

Shuffle-Split CV

Logistic Regression

0.751

0.751

0.740

Decision Tree

0.752

0.752

0.752

Adaboost

0.756

0.756

0.755

Gradiant Boosting

0.768

0.767

0.768

Bagged Trees

0.844

0.841

0.840

Random Forests

0.844

0.843

0.842

Table 5.6 shows the confusion matrix for the Bagged Trees and the
61

Random Forests for one fold out of stratified 10 fold cross-validation. The
accuracies for the Boosting and the Random Forests are;
(12666 + 54451)/(1873 + 74425 + 7194 + 5168) = 0.844 and
(2985 + 58058)/(2985 + 58058 + 16875 + 1561) = 0.768 respectively.
Table 5.6: Confusion matrix for 75th percentile classification for
one fold within stratified 10-fold cross-validation.

Actual

Actual

High

Low

High

2985

16875

Low

1561

58058

Predicted

(b) Random Forests

Predicted

(a) Boosting classifier

High

Low

High

12666

7194

Low

5168

54451

As the classes for the 75th percentile classification were unbalanced,
ROC AUC scores were used to assess the model’s performance. The ROC
AUC scores for the various machine learning methods are shown in Figure
5.8. The Random Forests and the Bagged Trees had comparatively good
ROC AUC scores of 0.940 and 0.939, respectively.

62

Figure 5.8: ROC AUC scores of classification when the target is
dichotomized using the 75th percentile.

The top 5 important variables selected using Random Forests when the
target was dichotomized using the 75th percentile, are:

1. HomeHealthAidesHours
2. HoursOfInformalHelp5Weekdays
3. HoursOfInformalHelp2WeekendDays
4. NumberOfMedications
5. BladderContinence

63

These variables were selected on the basis of their weightage of importance
using Random Forests.

5.3.4

Classification Results for 90th Percentile

The 90th percentile of the response variable is 1.59 average hours per
day. When this value was used to dichotomized the target, it resulted into
following target classifications (Table 5.7).
Table 5.7: Formation of classes when the response variable “Average Hours Per Day” is dichotomized using the 90th percentile.
High

78825

Low

709421

For classification on the basis of 90th percentile value of the target, the
Bagged Trees and the Random Forests were the most promising with the
highest accuracy of approximately 92% (Table 5.8). The classifiers such as
Decision Tree, Logistic and Boosting had comparatively low classification
accuracies.
Table 5.9 shows the confusion matrix for the Bagged Trees and the
Random Forests for one fold out of stratified 10 fold cross-validation. The
accuracies for the Bagged Trees and the Random Forest are;
(4152 + 69294)/(4152 + 69294 + 3795 + 2238) = 0.924 and
(4131 + 69309)/(4131 + 69309 + 3816 + 2223) = 0.924 respectively.

64

Table 5.8: 90th percentile classification evaluation using accuracy.
The highest accuracies are highlighted in bold.
ML Algorithms/CV

Stratified 10 Fold CV

10-Fold CV

Shuffle-Split CV

Adaboost

0.896

0.903

0.904

Logistic Regression

0.901

0.891

0.902

Gradient Boosting

0.902

0.900

0.901

Decision Tree

0.913

0.915

0.914

Bagged Trees

0.924

0.921

0.924

Random Forests

0.924

0.920

0.923

Table 5.9: Confusion matrix for 90th percentile classification for

(a) Bagged Trees

(b) Random Forests

Actual

Actual

High

Low

High

4152

3795

Low

2238

69294

Predicted

Predicted

one fold within stratified 10-fold cross-validation.

High

Low

High

4131

3816

Low

2223

69309

Since the classes for the 90th percentile classification were unbalanced,
so ROC AUC scores were used to assess the model’s performance. The
ROC AUC scores are shown in Figure 5.9. Both Random Forests and
Bagged Trees have a comparatively good ROC AUC score of 0.969. The
Decision Tree has the least ROC AUC score (0.700).

65

Figure 5.9: ROC AUC scores of classification when the target is
dichotomized using the 90th percentile.

The top 5 important variables selected using Random Forests when the
target was dichotomized using the 90th percentile, are:

1. HomeHealthAidesHours
2. HoursOfInformalHelp5Weekdays
3. HoursOfInformalHelp2WeekendDays
4. NumberOfMedications
5. Bathing
66

These variables were selected on the basis of their weightage of
importance using Random Forests.

5.3.5

Classification Results for 95th Percentile

The 95th percentile of the response variable is 1.93 average hours per
day. When this value was used to dichotomize the target, it resulted in the
following target classifications (Table 5.10).
Table 5.10: Formation of classes when the response variable “Average Hours Per Day” is dichotomized using the 95th percentile.
High

39356

Low

748890

Table 5.11: 95th percentile classification evaluation using accuracy. The highest accuracies are highlighted in bold.
ML Algorithms/CV

Stratified 10 Fold CV

10-Fold CV

Shuffle-Split CV

Logistic Regression

0.950

0.953

0.942

Decision Tree

0.951

0.951

0.953

Adaboost

0.951

0.950

0.951

Gradient Boosting

0.953

0.952

0.951

Bagged Trees

0.960

0.962

0.960

Random Forests

0.960

0.960

0.963

For the 95th percentile classification of the target, the accuracies of all
the algorithms are good and the least accuracy recorded is 0.942 (Table
67

5.11). The Bagged Trees and the Random Forest have the highest accuracy
of approximately 0.96%.
It was also noticed that when the higher percentile values (the 90th and
95th percentiles) were chosen to dichotomize the target cross-validation did
not contribute greatly to the variation in the value of accuracy. All of the
machine learning algorithms were producing analogous results for any of
the chosen cross-validation techniques.
Table 5.12 shows the confusion matrix for the Bagged Trees and the
Random Forest for one fold out of stratified 10 fold cross-validation. The
accuracies for the Bagged Trees and the Random Forest are;
(1873 + 74425)/(1873 + 74425 + 2094 + 1087) = 0.959 and
(1868 + 74421)/(1868 + 74421 + 2099 + 1091) = 0.0.959 respectively.
Table 5.12: Confusion matrix for 95th percentile classification for

(a) Bagged Trees

(b) Random Forests

Actual

Actual

High

Low

High

1873

2094

Low

1087

74425

Predicted

Predicted

one fold within stratified 10-fold cross-validation.

High

Low

High

1868

2099

Low

1091

74421

The classes for the 95th percentile classification were unbalanced, so
ROC AUC scores were used to assess the model’s performance. The ROC
AUC scores for the various machine learning classification techniques are
shown in Figure 5.10. Both Random Forests and the Bagged Trees have
68

comparatively good ROC AUC score of 0.979.

Figure 5.10: ROC AUC scores of classification when the target is
dichotomized using the 95th percentile.

The top 5 important variables selected using Random Forests when the
target was dichotomized using the 95th percentile, are:

1. HomeHealthAidesHours
2. HoursOfInformalHelp5Weekdays
3. HoursOfInformalHelp2WeekendDays
4. NumberOfMedications
69

5. ToiletUse

These variables were selected on the basis of their weightage of importance
using Random Forests.

70

Chapter 6
Discussion
The original data set was large, with 837,536 records and 423
attributes. It provided me with the chance to learn how to manage and
analyze large data sets. The processing time for a 1-fold cross-validation
program with classification algorithms was 24 hours. The processing time
for 10-fold cross-validation was 10 days. To ensure time efficiency, parallel
program processing was used. The provided individual cores were only 2
GHz. However, there were 40 such kinds of cores with a maximum available
RAM of 256 GB. It provided the liberty to get the results on time. The
involvement of patient partners set a standard for the research. It was
challenging to present machine learning results in a way that people
without a statistical background could understand. Their participation
increases interaction to access and address real-life home care challenges.
The original data set does not include the response variable “Average
Hours Per Day”. It was calculated from the “Assessment Start Date”,
“Care End Date”, and “Hours” columns of the Service data table. The
71

total number of care days was calculated by subtracting the assessment
start date from the care end date. The average number of hours per day for
the next 21 days from the assessment start date was calculated. On health
care expert recommendations, following an assessment, three weeks was
identified as the crucial time period for any issues reported in the
assessment to influence home care service use. The response “Average
Hours Per Day” was calculated so efficiently that I was able to find the
outliers.
The majority of “Average Hours Per Day” values fall within the range
of the mean plus three standard deviations, i.e., µ + 3σ = 4.09 hours.
Hence, it was considered pragmatic to place the clients who were, in the
majority, using small hours of Home Care services as “Low” users.
However, the domain experts and the patient partners were more interested
in classifying the clients who were using the Home Care services for long
hours. The count of such types of clients was very small, and they were
labelled as “High” users. This was the reason why only 2 classes were
formed. Furthermore, this was the initial stage of the research on this
Home Care data. In the future, multiple class classification of the target
can be taken into consideration based on adequate reasoning.
The R2 value of the multiple linear regression was 0.03% which was
very low. This small R2 indicated that the data were not fitting well to the
regression model. It might be possible that the data points were far away
from the fitted line and were not linear, which resulted in such a small R2 .
The Ridge regression algorithm also had a very small R2 (0.04%). Since the
Ridge is also the extension of the linear models, even after penalizing the
coefficients, the data were not fitting well to the linear model. The decision

72

tree’s variable splitting was also ineffective, yielding a small 0.05% R2
value. However, the Random Forests and the Bagged Trees performed well,
with R2 values of 53 and 52 percent, respectively. This implies that a single
decision tree was not able to explain the total variance of the regression
model. Random Forests and Bagged Tress, on the other hand, rely on the
performance of multiple small decision tree models and were thus provided
a comparatively good R2 values. Therefore, more than 50% of the variance
in the response variable was explained by the regression model when
Random Forests and Bagged Trees were used.
Initially, the algorithm of KNN was applied for classification. But KNN
suffers from the curse of the dimensionality of the data and is not
computationally efficient when the size of the data is large. KNN with
Euclidean distance for the range of K values from 1 to 9 and for 500 data
rows was taking around 20 minutes to predict. Mathematically, KNN took
0.04 minutes for a row. If I increase the size of the data to include 837,536
rows, it will take 33,501.44 minutes, i.e., 23.25 days. These days are just to
run the 1-fold of the cross-validation. For this research, 10-fold
cross-validation was used. To run a 10-fold CV, KNN will take 232.64 days,
i.e., 7.75 months, which is not reasonably time-efficient. The KNN
algorithm has the following drawback: it doesn’t scale effectively when
dealing with massive datasets or the data set having high dimensions
because KNN is a distance-based algorithm. When the data size is large,
the effort of estimating the distance between a new point and each
preexisting point is really large, which in turn lowers the algorithm’s
efficiency [Jain, 2021].
This research project is an example of patient-oriented research, so the

73

involvement of the patient partners is of paramount importance. They were
vital contributors in the selection of appropriate research questions, project
design, and analysis of results. Patient partners can also help uncover
common threads and relevant topics by examining narratives. This research
took care of the fact that the patients’ desired results are encouraged and
recorded. As the project continued, there were monthly meetings with
researchers and patient partners where the results of the applied approaches
were presented, discussed, and reviewed. The involvement of patient
partners to inform the research was indispensable. Their involvement
helped in correcting the applied algorithms results and declaring whether
the results are promising or not. Their participation in the research gave it
a decisive and experienced direction.
At this point, I would like to review and discuss the core research
questions. My first research question is about finding a promising regression
method that can predict the average number of hours per day of home care.
After setting the target to the average number of hours per day of
service use for 21 days after a home care assessment, different regression
algorithms were applied. The process of regression started with the
multiple linear regression model, followed by regularisation methods of
ridge and lasso. The R2 values for the above models were very low.
A decision tree for regression was also used, but analysing the
regression tree with large depth was very complex. The reason behind the
decision tree’s complexity was that, with 61 variables, it consisted of a large
number of decision and leaf nodes. Viewing these nodes one by one was
very critical. The R2 value for the decision tree was also small. Finally, the
ensemble methods of regression were employed to predict the target. The
74

Bagged Trees and the Random Forests gave comparatively good R2 values.
A regression model with Bagged Trees and Random Forests can be used to
predict the average number of hours a patient needs in home care.
My second research question is about finding good classification
methods to classify and dichotomize the number of hours per day on
average a client needs in home care.
The target “Average Hours Per Day” was dichotomized into classes.
The methods which were used to dichotomize the target into classes are
mean, median and the higher percentile values, i.e., 75th , 90th and 95th
percentile. Hence to perform this binary classification, initially the KNN
algorithm was used with a small amount of data (only 1000 rows). When
the data size was increased, KNN was not computationally efficient. Hence
it was dropped from the analysis. Therefore, other classification algorithms
were used, namely, logistic regression and decision trees. Since the data
were imbalanced, three different cross-validation techniques were used,
10-fold cross-validation, shuffle split cross-validation, and stratified 10-fold
cross-validation. To visualise a decision tree having a large depth with 61
variables was very complex. Since it consisted of a large number of decision
and leaf nodes. Ensemble methods were advised to use for classification and
better performance. The Random Forests and Bagged Trees had good
accuracy and ROC AUC scores for all classification models. Their confusion
matrix results were also good having low type I and type II errors.
Therefore, it is possible to classify the average number of hours per day
of home care services the patients are using on the basis of pragmatic
divisions like mean, median, 75th , 90th , and 95th percentiles. I was able to
find promising classification procedures to get good accuracy and ROC
75

AUC scores based on these divisions.
Finding the features that are significant out of 423 given variables, is
my third and last research question.
The given data set was provided with 423 variables, which means that
the data had moderate dimensionality since it possessed a large number of
features. It had become of utmost importance to find out the features
which affect the response variable (Average Hours Per Day) greatly.
Different machine learning techniques were used in order to select features.
The 132 features out of 423 variables were just the descriptions of the
variables, so these features were dropped, leaving 291 features to work on.
The first method was to find the features that possess low correlation
among them and to select a single variable from the group of highly
correlated variables. This process helped in selecting independent features.
These independent features, which possessed a high correlation with the
response, were selected. It helped me to select 27 features. The second
method for feature selection was to use the recursive feature elimination
technique. This technique is employed with Random Forests. The reason
for using Random Forests with the recursive feature elimination technique
is that the regression and classification results were comparatively better
with ensemble methods. Since RFE uses the weights (or coefficients) of the
features of the trained algorithm to sort the variables. It further brought
down the number of variables from 291. Through this process, only 53
variables were selected. The third technique to select the features out of
291 variables is the LASSO regularisation method, which penalizes the
regression coefficients as per the regularisation of the tuning parameter.
This regularisation method helped to select three variables. So, with the

76

help of the three techniques, the process of variable selection is done.
Further, these 27, 53, and 3 variables (the majority of the entries in
selected variables through high correlation and lasso were common with the
RFE selected variables) were presented in front of domain experts and
patient partners. On the basis of realistic health situations, these selected
features are cross verified and reviewed by the experts. The whole process
allowed me to select 61 features out of 423 initial attributes.
These variables can be plugged into the trained regression and
classification model to predict the number of hours on average a client
requires of home care in the near future.
While working on this research, the following limitations were noticed:
• The best R2 value for the regression model was 53%. This implies
that 53% of the variance of the response variable (Average Hours Per
Day) was explained by the predictor variables, which is not extremely
high. This could possibly be improved by tuning the regression
methodology, but it could also mean the provided variables were not
sufficient to explain all of the variance of the target. Including more
variables such as ethnicity, hereditary conditions, and medical history
of the clients could possibly improve the R2 value. Unfortunately,
there are many possible variables like these that were not part of our
data set, or, in some cases, not available at all in Canadian health
data.
• Classes of the response variable were very imbalanced. The use of
accuracy to evaluate the performance of the classification model was

77

unwise since the accuracy of the classification models became
saturated irrespective of the applied cross-validation technique. The
ROC AUC score was the answer to the issue, which measured the
area under the curve at different probability thresholds in order to
avoid over-fitting and model saturation.

78

Chapter 7
Conclusion
In this research, ensemble methods were found to be the most
promising for use in either regression or classification of the use of these
home care services. While using the mean to split the target, the Random
Forests performed nicely with 10-fold cross-validation to give an accuracy of
84.30%. Furthermore, when using the 90th and 95th percentiles as a basis
for classification, the accuracy of both the Bagged Trees and the Random
Forests irrespective of the cross-validation technique reached up to 92% and
96% respectively. The ROC AUC scores of these classifiers were recorded as
0.96 and 0.97 respectively. Therefore, when the higher quantile/percentile
values were used to classify the target of the interRAI home care assessment
(RAI-HC) data, the performance of the machine learning classifiers
increased. In regression, the use of R2 and MSE was done to check the
performance of the model. R2 values for Random Forests and Bagged Trees
were estimated as 53% and 52% respectively. Random Forests and Bagged
Trees were found to be promising for both classification and regression.

79

The thesis consists of the three research questions, the first question is,
What is the promising method to predict the number of hours
per day on average a client needs in home care?
The Random Forests and the Bagged Trees were found to be promising
machine learning methods to predict the number of hours per day on
average a client needs in home care. The noted R2 values for both methods
were 0.530 and 0.526, respectively. The R2 for boosting was 0.187 which
was comparatively low.
The second research question is, What are good classification
methods to classify and dichotomize the number of hours per day
on average a patient needs in home care?
For classification, the Bagged Trees and Random Forests were found to
be good choices to classify the target, Average Hours Per Day, when it was
dichotomized using the mean, the median, the 75th percentile, the 90th
percentile, and the 95th percentile. The highest ROC AUC and accuracy
were 0.97 and 0.96, respectively.
The third and last research question is, What are the significant
features to predict the usage of home care services for a client in
the near future?
The answer to the above question is, 61 significant features are selected
through the application of 3 methods, namely, Recursive Feature
Elimination with Random Forests, high correlation with the response, and
the regularisation method of LASSO. The Eleven very important features
are:

80

1. HomeHealthAidesHours
2. HoursOfInformalHelp5Weekdays
3. HoursOfInformalHelp2WeekendDays
4. NumberOfMedications
5. HomeHealthAidesDays
6. DressingLoweBody
7. PersonalHygiene
8. BladderContinence
9. Bathing
10. ModeOfLocomotionOutdoors
11. FallsFrequency

The 95th percentile of the response variable is approximately two hours
per day. In other words, a large majority of clients in home care receive a
maximum of two hours of services on average per day. This information can
be highly useful for the policymakers to allocate the available resources for
the sustainable planning of home care. This information can also be
beneficial for the patient partners and clients at home care to schedule their
available hours effectively.
With more refinement, domain experts and health researchers in home
care can use models like the ones in this research to predict and classify the
usage of hours by clients. Improvement of this research may provide the
81

option (within some confidence interval) to identify a client who shows the
traits of high fall frequency and number of medications as someone who
may need more attention in home care. Similarly, if a client uses services
like dressing the lower body, personal hygiene, bathing, and has a history of
using an ample amount of home health aid hours and days, it could
possibly be time for health care authorities to shift the client from home
care to long term care.
In future work, multi-class classification can be done to classify the
clients. Adding other variables such as ethnicity, hereditary conditions, and
medical history of the clients could improve the predictions of the usage of
home care services, and under-sampling or over-sampling techniques could
be used to overcome the imbalance class issue. Along with this, the
techniques of neural networks may be of good use to perform regression and
classification tasks to develop improved models. For health care experts
and researchers, this research is a vital step forward, and I hope it can be
used as a basis to develop other models and do future research on home
care or health care.

82

Bibliography
Sabyasachi Dash, Sushil Kumar Shakyawar, Mohit Sharma, and Sandeep
Kaushik.

Big data in healthcare: management, analysis and future

prospects.

Journal of big data, 6(1), 2019.

ISSN 2196-1115.

doi:

10.1186/s40537-019-0217-0. URL http://dx.doi.org/10.1186/s40
537-019-0217-0.
Canadian Institute for Health Information. Seniors in Transition: Exploring
Pathways Across the Care Continuum. 2017. URL https://www.cihi.c
a/sites/default/files/document/seniors-in-transition-report2017-en.pdf.
Heather Gilmour. Formal home care use in Canada. Technical Report 82003-X, 09 2019. URL https://www150.statcan.gc.ca/n1/en/pub/82003-x/2018009/article/00001-eng.pdf?st=D7muwQn8.
Canadian Institutes of Health Research. Strategy for Patient-Oriented Research: Patient Engagement Framework. 2019. URL https://cihr-irs
c.gc.ca/e/48413.html.
Ministry of Health. Home Care - Province of British Columbia, 12 2017.
URL https://www2.gov.bc.ca/gov/content/family-social-support

83

s/seniors/health-safety/health-care-programs-and-services/ho
me-care.
Mu Zhu, Lu Cheng, Joshua J. Armstrong, Jeff W. Poss, John P. Hirdes,
and Paul Stolee. Using Machine Learning to Plan Rehabilitation for Home
Care Clients: Beyond “Black-Box” Predictions, pages 181–207. Springer
Berlin Heidelberg, Berlin, Heidelberg, 2014. ISBN 978-3-642-40017-9. doi:
10.1007/978-3-642-40017-9 9. URL https://doi.org/10.1007/978-3642-40017-9 9.
Lu Cheng, Mu Zhu, Jeffrey W. Poss, John P. Hirdes, Christine Glenny, and
Paul Stolee. Opinion versus practice regarding the use of rehabilitation
services in home care: an investigation using machine learning algorithms.
BMC medical informatics and decision making, 15(1):80, 2015. ISSN 14726947. doi: 10.1186/s12911-015-0203-1. URL http://dx.doi.org/10.11
86/s12911-015-0203-1.
Aaron Jones, Andrew P. Costa, Angelina Pesevski, and Paul D. McNicholas. Predicting hospital and emergency department utilization among
community-dwelling older adults: Statistical and machine learning approaches. PLOS ONE, 13(11):e0206662, 2018. doi: 10.1371/journal.
pone.0206662.
Jacques-Henri Veyron, Patrick Friocourt, Olivier Jeanjean, Laurence Luquel,
Nicolas Bonifas, Fabrice Denis, and Joël Belmin. Home care aides’ observations and machine learning algorithms for the prediction of visits to
emergency departments by older community-dwelling individuals receiving home care assistance: A proof of concept study. PLOS ONE, 14(8):
e0220002, 2019. doi: 10.1371/journal.pone.0220002.

84

Roweida Mohammed, Jumanah Rawashdeh, and Malak Abdullah.

Ma-

chine learning with oversampling and undersampling techniques: Overview
study and experimental results. In 2020 11th International Conference on
Information and Communication Systems (ICICS), pages 243–248, 2020.
doi: 10.1109/ICICS49469.2020.239556.
Piyasak Jeatrakul and Kevin Wong. Comparing the performance of different
neural networks for binary classification problems. In 2009 Eighth International Symposium on Natural Language Processing, pages 111–115, 2009.
doi: 10.1109/SNLP.2009.5340935.
Deepak Jain. KNN: Failure cases, Limitations, and Strategy to Pick the Right
K, 12 2021. URL https://levelup.gitconnected.com/knn-failure-c
ases-limitations-and-strategy-to-pick-right-k-45de1b986428#:
$%$7E:text=Doesn’t$%$20work$%$20well$%$20with,the$%$20perform
ance$%$20of$%$20the$%$20algorithm.
Joseph Rocca. Ensemble methods: bagging, boosting and stacking - Towards
Data Science, 12 2021. URL https://towardsdatascience.com/ensem
ble-methods-bagging-boosting-and-stacking-c9214a10a205.
Bovas Abraham and Johannes Ledolter. Introduction to Regression Models.
2006.
Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An
Introduction to Statistical Learning: with Applications in R. Springer,
2013. URL https://faculty.marshall.usc.edu/gareth-james/ISL/.
Cran.R. Residual diagnostics, 2021. URL https://cran.r-project.org/w
eb/packages/olsrr/vignettes/residual diagnostics.html.

85

Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of
Statistical Learning. Springer Series in Statistics. Springer New York Inc.,
New York, NY, USA, 2001.
Abhigyan Singh. Cross-Validation Techniques - Geek Culture, 01 2022. URL
https://medium.com/geekculture/cross-validation-techniques-3
3d389897878.
Monocasual Triangles. Percentiles and quantiles, 08 2015. URL https:
//www.internalpointers.com/post/percentiles-and-quantiles.
Benai Kumar. Imbalanced classification, Jul 2020. URL https://www.anal
yticsvidhya.com/blog/2020/07/10-techniques-to-deal-with-clas
s-imbalance-in-machine-learning/.
Tom Fawcett. An introduction to roc analysis. Pattern recognition letters,
27(8):861–874, 2006.
Buitinck Lars, Louppe Gilles, Blondel Mathieu, Pedregosa Fabian, Mueller
Andreas, Grisel Olivier, Niculae Vlad, Prettenhofer Peter, Gramfort
Alexandre, Grobler Jaques, Layton Robert, VanderPlas Jake, Joly Arnaud, and Holt Brian. Scikit-learn: Machine learning in Python. Journal
of Machine Learning Research, 12:2825–2830, 2011.
Sarang Narkhede. Understanding AUC - ROC Curve - Towards Data Science,
03 2022. URL https://towardsdatascience.com/understanding-auc
-roc-curve-68b2303cc9c5.
Bekhruz Tuychiev. Powerful feature selection with recursive feature elimination (rfe) of sklearn, May 2021. URL https://towardsdatascience.com
/powerful-feature-selection-with-recursive-feature-eliminati
on-rfe-of-sklearn-23efb2cdb54e.
86

Appendix A
Program Code
Importing Libraries and Metrics

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from sklearn.model_selection import train_test_split
import sklearn.preprocessing as pp
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
import sklearn.tree as tree
import sklearn.ensemble as en
from sklearn.metrics import

roc_auc_score

from sklearn.metrics import accuracy_score
87

from sklearn import metrics
Cross-validation, Machine Learning, and Scoring

k=10
# 10-fold cross-validation
cv1 = KFold(n_splits = k, shuffle = True)
# Shuffle Split cross-validation
cv2 = ShuffleSplit(n_splits = k, test_size=1/k)
# Stratified 10-fold cross-validation
cv3= StratifiedKFold(n_splits = k, shuffle = True)
cvs=[cv1,cv2,cv3]

for cv in cvs:
print(cv)
# making the list of classifiers
clfs = [LogisticRegression(),
DecisionTreeClassifier(max_depth = 3),
tree.DecisionTreeClassifier(),
en.BaggingClassifier(n_estimators=500),
en.RandomForestClassifier(n_estimators=500),
en.AdaBoostClassifier(n_estimators=500),
en.GradientBoostingClassifier(n_estimators=500)]
for clf in clfs:
scores = np.zeros(k)
i = 0
roc=np.zeros(k)
j=0
88

for train_index, test_index in cv.split(X,y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
scores[i] = accuracy_score(y_test, y_pred)
roc[j]=roc_auc_score(y_test,clf.predict_proba(X_test)[:,1])
i +=1
j += 1
# Accuracy Score
print(type(clf), "Scores",scores)
# ROC AUC Score
print(type(clf), "roc",roc)
# Mean Accuracy Score
print(type(clf), "Mean_scores",np.mean(scores))
# Mean ROC AUC Score
print(type(clf), "Mean_roc",np.mean(roc))
print(type(clf), "St.Dev",np.std(scores))

Function for R2 calculation

# RSquare Function
def RSquare(y_true,y_pred):
rss=((y_true - y_pred)** 2).sum()
tss=((y_true - y_true.mean()) ** 2).sum()
res=1-(rss/tss)
return res

89