THOMPSON RIVERS UNIVERSITY

An Individual Participant Data Meta-Analysis of the Machine
Learning Models in Predicting Customer Churn

By

Khanh Van Tran

A GRADUATE PROJECT REPORT SUBMITTED IN PARTIAL
FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF
Master of Science in Data Science
KAMLOOPS, BRITISH COLUMBIA

August, 2024

SUPERVISOR
Dr. Jabed Tomal
Dr. Md Erfanul Hoque

© Khanh Van Tran, 2024

ABSTRACT
Predicting customer churn is a critical task for businesses aiming to retain customers
and maintain profitability. This research adopts an individual participant data metaanalysis (IPD-MA) approach to evaluate the effectiveness of various machine learning
models in predicting customer churn across multiple publicly available datasets. This
methodology facilitates a robust comparison and validation of predictive models by integrating raw data from different studies. The study employs a two-stage approach: first,
individual datasets are analyzed to obtain machine learning performance metrics; second,
these aggregated metrics are combined using fixed-effect and random-effect meta-analysis
models. The results reveal significant variability in model performance across different
datasets, with ensemble methods like Catboost, Lightgbm, and Gradient Boosting consistently outperforming other models, achieving the highest average AUCs of 0.9036,
0.9000, and 0.8936, respectively. The study also highlights the importance of considering
dataset-specific characteristics and model capabilities, as well as the necessity of accounting for heterogeneity in meta-analyses. This research makes several key contributions,
including methodological advancements in applying IPD-MA to machine learning, and a
comprehensive evaluation of model performance. The findings offer a valuable reference
for selecting and optimizing machine learning models in various industrial applications,
guiding future research and practical implementations.
Key Words: Customer Churn, Machine Learning, Meta-Analysis, Individual Participant
Data, Fixed-Effect Model, Random-Effect Model.

ii

ACKNOWLEDGEMENTS
Firstly, I would like to express my sincere gratitude to my two supervisors, Dr. Jabed
Tomal and Dr. Md Erfanul Hoque for teaching, supervising, guiding, and mentoring me
in the entire process of my graduation project, which indeed has helped me to shape my
research properly and overcome lots of obstacles.
An additional warm thank you goes to the instructors in the MSc Data Science program
at Thompson Rivers University: Dr. Roger Yu, Dr. Mohamed Tawhid, Dr. Becky Wei
Lin, Dr. Mateen Shaikh, and Dr. Mila Kwiatkowska.
And I’d like to thank my family for the love and support they’ve shown me every step
of the way. Their unwavering faith in me helped carry me through all this.

iii

Contents

1 Introduction

1

1.1

Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.3

Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.4

Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

1.5

Significance of The Study . . . . . . . . . . . . . . . . . . . . . . . . . .

3

1.6

Structure of The Project . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

2 Background

5

2.1

The Importance of Customer Churn Prediction . . . . . . . . . . . . . .

5

2.2

Machine Learning in Churn Prediction . . . . . . . . . . . . . . . . . . .

6

2.3

Machine Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . .

8

2.3.1

Logistic Classification . . . . . . . . . . . . . . . . . . . . . . . . .

8

2.3.2

The K-Nearest Neighbors (KNN) . . . . . . . . . . . . . . . . . .

9

iv

CONTENTS

2.4

2.5

v

2.3.3

Support Vector Machines (SVM) . . . . . . . . . . . . . . . . . .

10

2.3.4

Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

2.3.5

Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

2.3.6

Ensemble Classifiers . . . . . . . . . . . . . . . . . . . . . . . . .

12

Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

2.4.1

Stratified K-Fold Cross-Validation . . . . . . . . . . . . . . . . . .

16

2.4.2

Model Performance Metrics . . . . . . . . . . . . . . . . . . . . .

16

Meta-Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

3 Data

22

4 Methodology

27

4.1

Individual Dataset - Exploratory Data Analysis and Data Processing . .

28

4.1.1

Exploratory Data Analysis (EDA) . . . . . . . . . . . . . . . . . .

28

4.1.2

Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . .

33

Individual Dataset - Models Evaluation . . . . . . . . . . . . . . . . . . .

36

4.2.1

Predictive Modeling . . . . . . . . . . . . . . . . . . . . . . . . .

36

4.2.2

Classifiers Performance Evaluation . . . . . . . . . . . . . . . . .

37

4.2.3

Hyperparameter Tuning . . . . . . . . . . . . . . . . . . . . . . .

37

4.3

Aggregate Effect Size and Standard Error . . . . . . . . . . . . . . . . .

38

4.4

Estimating Pooled Effect Size and Confidence Interval . . . . . . . . . . .

39

4.2

CONTENTS

4.5

4.6

vi

4.4.1

Fixed-Effect Model . . . . . . . . . . . . . . . . . . . . . . . . . .

39

4.4.2

Random-Effect Model . . . . . . . . . . . . . . . . . . . . . . . .

41

Residual Heterogeneity Estimates . . . . . . . . . . . . . . . . . . . . . .

42

4.5.1

Test for Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . .

43

4.5.2

Fixed-Effect Model . . . . . . . . . . . . . . . . . . . . . . . . . .

43

4.5.3

Random-Effect Model . . . . . . . . . . . . . . . . . . . . . . . .

44

Funnel Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

5 Results

47

5.1

Aggregated Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

5.1.1

Performance Metrics by Machine Learning Model . . . . . . . . .

47

5.1.2

Performance Metrics by Dataset . . . . . . . . . . . . . . . . . . .

48

Fixed- and Random-Effect Model Results . . . . . . . . . . . . . . . . . .

49

5.2.1

Meta-Analysis by Machine Learning Model . . . . . . . . . . . . .

50

5.2.2

Meta-Analysis by Dataset . . . . . . . . . . . . . . . . . . . . . .

51

Residual Heterogeneity Estimates . . . . . . . . . . . . . . . . . . . . . .

53

5.3.1

Residual Heterogeneity Estimates by Machine Learning Model . .

53

Funnel Plots Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

5.4.1

Funnel Plot Analysis by Machine Learning Model . . . . . . . . .

55

5.4.2

Funnel Plot Analysis by Dataset . . . . . . . . . . . . . . . . . . .

57

5.2

5.3

5.4

CONTENTS

vii

5.5

58

Feature Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6 Discussion and Conclusion
6.1

6.2

67

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

67

6.1.1

Variability in Model Performance . . . . . . . . . . . . . . . . . .

67

6.1.2

Top Performing Models and Datasets . . . . . . . . . . . . . . . .

68

6.1.3

Fixed-Effect vs. Random-Effect Models . . . . . . . . . . . . . . .

69

6.1.4

Discussion on Funnel Plots Analysis . . . . . . . . . . . . . . . . .

69

6.1.5

Feature Importance . . . . . . . . . . . . . . . . . . . . . . . . . .

71

6.1.6

Implications for Practice . . . . . . . . . . . . . . . . . . . . . . .

72

6.1.7

Contributions of This Research . . . . . . . . . . . . . . . . . . .

73

6.1.8

Limitations and Future Research . . . . . . . . . . . . . . . . . .

74

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

74

A Source Codes

80

List of Figures

2.1

Stratified 5-Fold Cross Validation. . . . . . . . . . . . . . . . . . . . . . .

17

2.2

Confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

2.3

The ROC-AUC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

4.1

Two-stage IPD-MA Workflow . . . . . . . . . . . . . . . . . . . . . . . .

28

4.2

Target Distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

4.3

Categorical Features Distribution in Credit Card Dataset.

. . . . . . . .

31

4.4

Numerical Features Distribution in Bank Dataset. . . . . . . . . . . . . .

32

4.5

Correlations Among Numerical Features in Telcom Churn Dataset. . . .

33

5.1

Fixed-Effect Model Forest Plot by ML. . . . . . . . . . . . . . . . . . . .

50

5.2

Random-Effect Model Forest Plot by ML. . . . . . . . . . . . . . . . . .

51

5.3

Fixed-Effect Model Forest Plot by Dataset. . . . . . . . . . . . . . . . . .

52

5.4

Random-Effect Model Forest Plot by Dataset. . . . . . . . . . . . . . . .

53

5.5

Fixed-Effect Model Funnel Plot by Machine Learning Model. . . . . . . .

55

viii

LIST OF FIGURES

ix

5.6

Random-Effect Model Funnel Plot by Machine Learning Model. . . . . .

56

5.7

Fixed-Effect Model Funnel Plot by Dataset. . . . . . . . . . . . . . . . .

57

5.8

Random-Effect Model Funnel Plot by Dataset. . . . . . . . . . . . . . . .

58

5.9

Top 10 Feature Importance in Catboost, LightGBM, and Gradient Boosting for Bank Dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59

5.10 Top 10 Feature Importance in Catboost, LightGBM, and Gradient Boosting for Credit Card Dataset. . . . . . . . . . . . . . . . . . . . . . . . . .

60

5.11 Top 10 Feature Importance in Catboost, LightGBM, and Gradient Boosting for E-Commerce Dataset. . . . . . . . . . . . . . . . . . . . . . . . .

61

5.12 Top 10 Feature Importance in Catboost, LightGBM, and Gradient Boosting for Employee Dataset. . . . . . . . . . . . . . . . . . . . . . . . . . .

62

5.13 Top 10 Feature Importance in Catboost, LightGBM, and Gradient Boosting for Internet Dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . .

63

5.14 Top 10 Feature Importance in Catboost, LightGBM, and Gradient Boosting models for the Cell2Cell dataset. . . . . . . . . . . . . . . . . . . . .

63

5.15 Top 10 Feature Importance in Catboost, LightGBM, and Gradient Boosting models for the Membership dataset. . . . . . . . . . . . . . . . . . . .

64

5.16 Top 10 Feature Importance in Catboost, LightGBM, and Gradient Boosting models for the Nigeria Telecom dataset. . . . . . . . . . . . . . . . .

64

5.17 Top 10 Feature Importance in Catboost, LightGBM, and Gradient Boosting models for the SA Wireless dataset. . . . . . . . . . . . . . . . . . . .

65

5.18 Top 10 Feature Importance in Catboost, LightGBM, and Gradient Boosting for Telco Europa Dataset. . . . . . . . . . . . . . . . . . . . . . . . .

65

LIST OF FIGURES

x

5.19 Top 10 Feature Importance in Catboost, LightGBM, and Gradient Boosting for Telecom Dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . .

66

List of Tables

3.1

Summary of Churn Datasets . . . . . . . . . . . . . . . . . . . . . . . . .

22

4.1

Summary of Internet Churn Dataset . . . . . . . . . . . . . . . . . . . .

29

5.1

Mean AUC, Standard Deviation, and Standard Error for Each Model . .

48

5.2

Mean AUC, Standard Deviation, and Standard Error for Each Dataset .

49

xi

Chapter 1
Introduction

1.1

Background and Motivation

In today’s fast-paced business landscape companies are increasingly turning to data
driven strategies to maintain their edge. Customer churn is when a customer stops
purchasing products and services from a company. Customers are hard to acquire and
might be very expensive for an organization. If a customer does not use your firm’s products and services in the long run, it will inevitably lead to the bad fate of an organization.
This is why customer retention is also as important as customer acquisition. Without
appropriate attention to customer churn management, it can result in heavy monetary
losses, reduced profitability, and loss of visibility in the market. Therefore, organizations
started relying on predictive analytics or machine learning approaches to predict or avoid
customer churn.
Machine learning (ML) technologies have revolutionized approaches, providing new
ways to forecast and analyze customer churn. ML models use algorithms to analyze
datasets and uncover patterns that may not be evident using traditional statistical methods. However, the effectiveness of these models can vary depending on the characteristics
and industry specifics.
1

Given the availability of datasets and diverse ML techniques, there is a strong demand
to combine these resources to evaluate the performance of these models thoroughly. This
project utilizes an individual participant data meta-analysis a statistical technique that
reanalyzes raw data from multiple studies to assess how reliable and effective ML models
in predicting customer churn across different datasets and industries.

1.2

Problem Statement

Despite the critical insights provided by individual studies on ML models for churn
prediction, there is substantial variability in their outcomes. This variation may be
influenced by factors such as dataset characteristics, industry-specific dynamics, and
methodological approaches. This research aims to mitigate these inconsistencies through
an IPD-MA, offering a deeper and more accurate understanding of how different models
perform across varied contexts. By integrating data from multiple sources, this study
seeks to establish a clearer understanding of the generalizability and limitations of current
machine learning approaches to churn prediction.

1.3

Research Objectives

This study is structured around several key objectives designed to address the complexities of machine learning in churn prediction:

1. Comprehensive Model Evaluation: To quantitatively assess the performance metrics of various machine learning models across 11 public datasets in predicting
customer churn.
2. Cross-industry Applicability: To investigate the efficacy of these models across
different industry sectors, identifying how contextual variables influence predictive
success.
2

3. Feature Relevance Analysis: To explore which variables are consistently significant
predictors of churn across multiple models and datasets, highlighting critical factors
influencing customer retention.

1.4

Research Questions

1. Which machine learning models most accurately predict customer churn across
diverse datasets?
2. Are there universally significant predictors of churn, or do these vary significantly
across different models and datasets?
3. To what extent do machine learning models maintain their predictive accuracy
across various industry sectors?
4. How can machine learning models be optimized to improve churn prediction based
on the insights gained from the meta-analysis?

1.5

Significance of The Study

The significance of this study lies in its potential to:

• Bridge Knowledge Gaps: By providing an analysis of machine learning models
across multiple datasets, this study aims to fill existing gaps in the literature regarding the comparative effectiveness and limitations of these models.
• Enhance Business Strategies: The findings will aid businesses in refining their customer retention strategies by adopting the most effective predictive models tailored
to their specific industry conditions.

3

• Contribute to Academic Knowledge: This research contributes to the academic
field by demonstrating the application of IPD-MA in machine learning, potentially
setting a precedent for future research methodologies.

1.6

Structure of The Project

This research project is structured as follows:

• Chapter 2: Background - This chapter comprehensively reviews existing studies on
machine learning models for churn prediction, outlining theoretical underpinnings
and methodological advancements.
• Chapter 3: Data - Describes data sources used in the study.
• Chapter 4: Methodology - Describes the meta-analytical approach, model selection,
and statistical methods employed in the study.
• Chapter 5: Results - Analyze the results and discuss the implications of findings in
the context of existing research and industry practices.
• Chapter 6: Discussion and Conclusion - discuss and conclude with a summary of
key findings, discuss limitations, and suggest directions for future research.

4

Chapter 2
Background
This chapter explains the importance of customer churn prediction. It also provides a
summary of the literature review about the application of machine learning in the area
of churn prediction. In addition, a brief description of the machine learning algorithms
and meta-analysis techniques is also included.

2.1

The Importance of Customer Churn Prediction

Predicting customer churn plays a crucial role in Customer Relationship Management
(CRM). The main goal of the CRM is to build and sustain lasting relationships with
customers. This approach is important in many sectors, particularly to subscriptionbased service firms, such as those in telecommunications, insurance, banking, and online
services [1]. Given that such companies rely on consistent and regular membership fees,
it is imperative to curb customer switching behavior to maintain sustainable profits.
Consequently, the precise prediction of customers likely to churn has become a paramount
objective in the industry.

5

2.2

Machine Learning in Churn Prediction

Churn prediction study has gained popularity in recent years, due to its substantial
impact on corporations and businesses. Machine learning has been widely studied and
applied in churn prediction across several industries such as telecommunications, human
resources, finance, and online services.
Several case studies have been published in the telecommunications industry. Lalwani, Mishra, Chadha, et al. [2] utilized a range of predictive models, including logistic
regression, Naive Bayes, support vector machine, random forest, decision trees, boosting,
and ensemble approaches, to forecast customer turnover in the telecom sector. K-fold
cross-validation was utilized for hyperparameter tuning and to mitigate overfitting. The
Adaboost and XGBoost classifiers demonstrated the best accuracies among the built
models, with 81.71% and 80.8%, respectively. Ahmad, Jafar, and Aljoumaa [3] created
a churn prediction model utilizing machine learning techniques on a big data platform
to help telecom operators forecast which customers are likely to churn. The dataset
was acquired from SyriaTel telecommunications firm. The paper shows that XGBoost
algorithm achieved the most optimal outcomes for predicting churn. In addition, Ullah,
Raza, Malik, et al. [4] introduced a churn prediction model that utilizes classification and
clustering approaches to identify consumers who are likely to churn and to understand
the underlying variables contributing to customer turnover in the telecommunications
industry. The model employed the Random Forest technique to classify churn and leveraged cosine similarity to categorize churn clients. The process involved the utilization of
feature selection techniques and an attribute-selected classifier algorithm to detect and
determine the key elements contributing to churn. The assessment of the suggested approach showcased greater churn categorization and client profiling, facilitating customer
retention tactics, promotion suggestions, and improved marketing campaigns. A case
study by Qureshi, Rehman, Qamar, et al. [5] on churn prediction in the mobile communication market. It describes the use of regression analysis, decision trees, artificial neural
networks, and logistic regression to predict potential churners. The study used a dataset
6

from the Customer DNA website, where usage data for 106,000 customers was provided
over three months, along with total usage by the customers. Dealing with the problem
of class imbalance in the dataset, the authors check how the different machine learning
algorithms performed regarding real usage by the customer, while demonstrating that
the decision trees were the best classifiers for identifying potential churners.
In the human resource (HR) management sector, Sisodia, Vishwakarma, and Pujahari
[6] built a model for predicting employee churn rate based on HR analytics dataset,
using five different machine learning algorithms, namely, linear support vector machine,
decision tree classifier, random forest, k-nearest neighbor, and naı̈ve bayes classifier. The
authors evaluated the correlation between attributes, generated a histogram to contrast
left employees with various factors, and proposed strategies to optimize employee attrition
in organizations.
Rahman and Kumar [7]’s paper focuses on predicting customer churn in a commercial
bank using efficient data mining methods. It discusses data transformation and classification techniques such as k-nearest neighbor, support vector machine, decision tree,
and random forest. The paper results show that oversampling improves the accuracy of
decision trees and random forest classifiers, while support vector machine is not suitable
for large amounts of data. The study also analyzes customer behavior to explore the
likelihood of churn and compares the performance of different models, finding that the
Random Forest model after oversampling achieves higher accuracy.
Several published papers also try to provide a generalizability of ML in churn prediction. Garcı́a, Nebot, and Vellido [8] suggested a more comprehensive literature review.
The authors explain several phases involved in churn prediction analysis, including data
collecting, feature selection, model implementation, and assessment techniques and metrics. Their survey closes with suggestions derived from the existing body of information.
Geiler, Affeldt, and Nadif [1] focused on churn prediction in businesses and explored the
performance of various supervised and semi-supervised learning methods and sampling
approaches on publicly available datasets. The study suggests an ensemble approach

7

should be used for churn prediction.
Given the diversity of methods and the varying results across different studies, there
is a compelling need for a meta-analytical approach to churn prediction. Individual Participant Data Meta-Analysis stands out as an advantageous method for this purpose. By
integrating data from multiple studies, IPD-MA allows for more comprehensive analysis
and interpretation of churn prediction techniques across different datasets and industries. Furthermore, IPD-MA supports the application of uniform statistical methods
across different studies, providing a more reliable and consistent framework for evaluating the efficacy of machine learning techniques in churn prediction. This approach is
particularly valuable in the context of churn prediction, where variations in industryspecific factors and model implementations can significantly influence the effectiveness of
predictive strategies.

2.3

Machine Learning Algorithms

2.3.1

Logistic Classification

Logistic regression is the simplest machine learning algorithm for binary as well as multiclass classification. It estimates the probability that an instance belongs to one of the
classes as a function of input features using the logistic function (Eq. 2.1).

P (y = 1|x) =

eβ0 +β1 x1 +β2 x2 +...+βk xk
1 + eβ0 +β1 x1 +β2 x2 +...+βk xk

(2.1)

where:
• P (y = 1|x) is the probability that the outcome y is 1 given the input features x.
• x = (x1 , x2 , . . . , xk ) is the vector of input features data.

8

• β0 , β1 , β2 , . . . , βk are the model parameters.
• e is the base of the natural logarithm.
During training, the model parameters are optimized using techniques such as maximum likelihood estimation and gradient descent [9][10]. It is very efficient to compute
and highly interpretable.

2.3.2

The K-Nearest Neighbors (KNN)

KNN is a machine learning method used for both regression and classification tasks. KNN
utilizes distance measurements to predict the most probable value of the target feature.
The Euclidean distance is often used in KNN and it is calculated using Equation 2.2 [9].
The predicted class in categorization is the most popular class among the k neighbors.
The optimal k nearest neighbors are identified for the given case based on the crossvalidation. This method is non-parametric. It is a categorization method based solely
on examples, utilizing existing data without generalization. It is called a lazy learning
algorithm since its stages and operations are executed during the query. The algorithm
does not require any preparation. It is effective with low-dimensional data but less so
with high-dimensional data. To use KNN for high-dimensional data, we may use principle
component analysis before implementing KNN.

v
u n
uX
d(x, y) = t (xi − yi )2

(2.2)

i=1

where:
• d(x, y) is the Euclidean distance between points x and y.
•

Pn

i=1 denotes the summation over all dimensions from 1 to n.

• xi and yi are the coordinates of points x and y in the i-th dimension, respectively.
9

• (xi − yi )2 is the squared difference between the i-th coordinates of the two points.

2.3.3

Support Vector Machines (SVM)

SVM is a supervised machine learning technique that has been widely used for classification purposes. It searches the hyperplane that accurately divides the majority of
the training data into two classes, while it may incorrectly categorize a small number of
observations. This is the optimal solution to the following optimization problem [9].

maximize

β0 ,β1 ,...,βp ,ϵ1 ,...,ϵN ,M

M

subject to

(2.3)
p
X

βj2 = 1

(2.4)

j=1

yi (β0 + β1 xi1 + β2 xi2 + · · · + βp xip ) ≥ M (1 − ξi )
ξi ≥ 0,

n
X

ξi ≤ C

(2.5)
(2.6)

i=1

where:
• M is the width of the margin.
• β0 , β1 , . . . , βp are the coefficients of the hyperplane
• ϵ1 , . . . , ϵN individual observations that allow fall on the incorrect side of the margin
or the hyperplane.
• x1 , . . . , xn set of n training observations
• y1 , . . . , yn associated class labels
• C is a nonnegative tuning parameter.

10

SVM can perform both linear and nonlinear classification. It can find a nonlinear
separator, also known as a functional, by transforming the data into a higher-dimensional
space. The algorithm then uses linear classification by selecting the support vectors
to define the decision boundary or hyperplane. The SVM algorithm can utilize kernel
functions, such as linear, polynomial, and radial basis function (RBF) to handle both
linearly and nonlinearly separable data [10].

2.3.4

Naive Bayes

Naive Bayes is a type of probabilistic classifier model, which is based on the Bayes
theorem. The Naive part comes from the assumption that observation characteristics are
conditionally independent given the class label. It is stated mathematically as Equation
2.7 [9]. Multiple class predictions are made feasible by probabilistic classifiers. Based on
conditional probability, the decision is made.

πk × fk1 (x1 ) × fk2 (x2 ) × · · · × fkp (xp )
P r(Y = k | X = x) = PK
l=1 πl × fl1 (x1 ) × fl2 (x2 ) × · · · × flp (xp )

(2.7)

for k = 1,...,K
where:
• Pr(Y = k | X = x) is posterior probability that an observation X belongs to k
class.
• πk is prior probability that a randomly chosen observation comes from the k class.
• fkj is the density function of the jth predictor among observations in the kth class.

11

2.3.5

Decision Trees

Decision trees are powerful and versatile techniques used for classification and regression.
It is commonly referred to as the Classification and Regression Trees (CART) method.
The method is capable of handling data with large dimensionality and can process both
numerical and categorical data. It operates by recursively partitioning the dataset into
subsets based on the most informative features [9]. Each internal node of the tree represents a decision point where a feature is evaluated, and each leaf node corresponds to a
class label or a regression value. Decision trees are constructed using various criteria such
as Gini impurity or information gain, to optimize the splitting process [10]. They are
interpretable models that facilitate human understanding of decision-making processes
in complex datasets.

Gi = 1 −

n
X

p2i,k

(2.8)

k=1

where:
• Gi is gini score of node i-th
• pi,k is the training instance ratio of class k in the i-th node.

2.3.6

Ensemble Classifiers

Ensemble method can help to minimize the variance and bias of separate models or classifiers by merging their predictions, resulting in more accurate and reliable models [11].
There are three commonly used ensemble learning methods, namely bagging, boosting,
and stacking, which can be utilized to enhance the machine learning process. In this
section, we will delve into the details of each method, including its working nature, characteristics regarding data generation, training of baseline classifiers, and suitable fusion

12

methods. We will also cover the advantages, disadvantages, and implementation challenges associated with each method.
Bagging
The bagging method, also referred to as bootstrap aggregating, is a data-driven
algorithm that involves creating multiple subsets of data from the original dataset [12].
Bagging aims to generate diverse predictive models by adjusting the distribution of training datasets, where even small changes in the training data set can lead to significant
changes in the model predictions. Bagging reduces variance, eliminates overfitting, and
performs well on high-dimensional data. However, it also has some drawbacks such
as being computationally expensive, having high bias, and reducing the interpretability of models [13]. The Random Forests algorithm is a notable example of the bagging
technique [14]. Implementing the bagging method presents several challenges, such as
determining the optimal number of base learners and subsets, the maximum number of
bootstrap samples per subset, and the fusion method for integrating the outputs of the
base classifiers using various voting methods. In summary, the bagging method employs
parallel ensemble techniques with no data dependency, and the fusion methods depend
on different voting methods to generate predictions. This approach generates B different
bootstrapped training data sets. The algorithm is then trained on the bth bootstrapped
training set to predict a point x. The following equation is the bagging function:

B

1 X
f (x) =
fb (x)
B b=1

(2.9)

where fb (x) is a weak learner
Boosting
The boosting approach is a sequential procedure in which each succeeding model attempts to correct the prior model’s errors [15]. Boosting employs multiple weak learners
in a highly adaptive way, where each model in the sequence is fitted while giving more

13

importance to observations that the previous models in the sequence handled poorly.
Boosting can be used for both regression and classification problems and includes algorithms such as Adaptive Boosting (AdaBoost), Stochastic Gradient Boosting (SGB),
and Extreme Gradient Boosting (XGB) [16]–[18]. Several studies have utilized various
types of boosting, including AdaBoost for noise detection and speech feature extraction,
and XGB for fake news classification [19]–[21]. Boosting provides interpretability and
helps reduce variance and bias in machine learning ensembles. However, the drawback
of boosting is that each classifier must correct errors in the predecessors, and scaling
sequential training can be challenging. Additionally, boosting is computationally costly,
vulnerable to overfitting, and slower to train than bagging. To summarize, boosting is
an ensemble learning technique that uses a sequential approach, where multiple learners learn sequentially with data dependency, and fusion methods rely on various voting
methods. The boosting function is shown as follows:

f (x) =

X

λi gi (x)

(2.10)

i

Where several classifiers gi (x) create a strong classifier f (x). The inclusion of the
shrinkage parameter λ in the model slows down the process, which enables a wider range
of differently shaped trees to be utilized to address the residuals.
Stacking
The stacking method is a powerful model ensembling technique that combines multiple predictive models to generate a new model, also known as a meta-model [22]. The
stacking model architecture consists of two or more base models (level 0 models) and
a meta-model that integrates the predictions of the base models (level 1 models). The
base models are fit on the training data and their predictions are compiled, while the
meta-model learns how to best combine these predictions. One of the key benefits of
stacking is its ability to provide a deeper understanding of the data, leading to increased
precision and effectiveness in predictions. However, a major challenge with stacking is

14

overfitting, which can occur when there are many predictors that all predict the same
target. Additionally, multi-level stacking can be both data and time-intensive, as each
layer adds multiple models. As the amount of available data grows exponentially, computation time complexity becomes an issue, and highly complex models may take months
to run [23]. Another challenge with stacking is interpreting the final model, as well
as identifying the appropriate number and baseline models that can be relied upon to
generate better predictions from the dataset when designing a stacking ensemble from
scratch. The problem of multi-label classification also poses issues such as overfitting and
the curse of dimensionality due to the high dimensionality of the data [24]. To sum up,
stacking is a parallel ensemble method that creates baseline learners simultaneously without any data dependency. The meta-learning method determines the fusion techniques.
Although stacking can be very successful, it poses several challenges such as overfitting,
complexity, and interpretability, which must be taken into account when using it. The
stacking model is as follows:

fs (x) =

n
X

ai fi (x)

(2.11)

i=1

Stacking makes predictions from several models (f1 , f2 , . . . , fn ) to build a new model,
where the new model is used to make predictions on the test dataset. Stacking aims to
improve a model’s predictive ability. To put it simply, stacking involves combining the
predictions of multiple models by assigning weights to each prediction and adding them
together by a linear combination of weights ai .

15

2.4

Model Evaluation

2.4.1

Stratified K-Fold Cross-Validation

Since the dataset is highly skewed, stratified K-fold cross-validation is used to evaluate
ML models. In the K-fold cross-validation, the original dataset has first been divided
into K-folds. Each fold has an equal number of training cases and an equal number of
test cases. For model training, the train model is built based on K-1 folds as a training
set and tested with the K-th test fold.
This process has been repeated K times. Each fold takes a turn to be the test set.
Finally, the K-time modeled performance has been summarized. The major disadvantage
with the K-fold cross-validation design is that when there is a severe class imbalance
between classes in the original data set, the size of one class in the training fold is
much less than the number of cases in the other class of the original data set [25]. The
consequence of that is that the performance outcome in the validation evaluation is too
optimistic and in favor of the model.
Therefore, stratified K-fold cross-validation is recommended in this case, and each
fold should be stratified in the same way as the original data set. This can ensure that
the number of each class in every fold is the same as the class distribution of the original
data set. Figure 2.1 Stratified 5-fold cross-validation (Adapted from [25])

2.4.2

Model Performance Metrics

The performance of the models developed in this study is assessed using the confusion
matrix and its derived metrics, namely accuracy, precision, recall, specificity, and F1
score. Figure 2.2 demonstrates the confusion matrix represented with actual and predicted outcome categories plotted against its axis (Adapted from [26]).

16

Figure 2.1: Stratified 5-Fold Cross Validation.

Figure 2.2: Confusion matrix.
Accuracy
Equation 2.12 calculates accuracy as the proportion of correctly predicted classes
over the total number of instances in the dataset [9]. Accuracy is used to measure the
overall performance of a model but can be misleading in imbalanced datasets

Accuracy =

TP + TN
TP + FP + TN + FN

(2.12)

Precision

17

Precision shows the proportion of positive predictions that are truly positive [9]. It
is basically a ratio of correctly positively labeled to all positively labeled and can be
calculated as follows:

P recision =

TP
TP + FP

(2.13)

Recall
Recall has other names Sensitivity or True Positive Rate. It is a measure of the
percentage of actual positive classes that are predicted as positive [9]. It can be calculated
as follows:

Recall =

TP
TP + FN

(2.14)

F1 Score
The F1 Score is a measure that is calculated from precision and recall (Equation
2.15). It is often a preferred metric over accuracy when data is unbalanced.

F1 Score =

2 ∗ P recision ∗ Recall
P recision + Recall

(2.15)

ROC-AUC
The ROC AUC curve is an assessment metric for classification at various discrimination thresholds. The Receiver Operator Characteristic (ROC) is the probability curve,
and AUC is the area under the ROC curve that estimates the degree of separability. It indicates how well the model can distinguish the two classes. The higher the AUC, the better the model [9]. AUC is calculated by plotting the True Positive Rate (TPR)/Sensitivity
on the y-axis versus the False Positive Rate (FPR)/(1-Specificity) on the x-axis (Fig. 2.3).

18

Figure 2.3: The ROC-AUC.

2.5

Meta-Analysis

Meta-analysis is a statistical method designed to synthesize and quantitatively analyze
the results of multiple scientific studies that investigate the same research question. The
term meta-analysis was first coined by statistician Gene V. Glass in the 1970s, who described it as the statistical analysis of a large collection of analysis results from individual
studies for the purpose of integrating the findings [27]. Glass’s work laid the foundation
for the development of meta-analysis methodologies which were enhanced in the late
1970s and 1980s by researchers such as Thomas D. Cook, Donald T. Campbell, and
Frank L. Schmidt. These researchers introduced advanced techniques for systematically
combining findings, evaluating heterogeneity among study outcomes, and addressing potential biases [28].
One key aspect of meta-analysis is the pooling of effect sizes, which involves aggregating the effect sizes from different studies to obtain an overall effect size. This process
is typically conducted using either a fixed-effect model or a random-effect model. The
fixed-effect model assumes that there is one true effect size which is shared by all included
studies, regardless of between-study variance. It uses the following formula to calculate

19

the pooled effect size:

Pn
(w × yi )
Pn i
θ = i=1
i=1 wi

(2.16)

where yi is the effect size from study i and wi is the weight of study i, typically the
inverse of the variance of yi .
The random-effect model, on the other hand, assumes that the effect sizes vary
between studies due to genuine diversity in study characteristics, and incorporates this
heterogeneity into the estimation of the overall effect size. The estimate of the pooled
effect size in a random-effect model is given by:
Pn
(w∗ × y )
Pn i ∗ i
θ = i=1
i=1 wi

(2.17)

where wi∗ = var(y1i )+τ 2 , and τ 2 (tau-squared) is the estimator of between-study variance.
By the 1980s and 1990s, meta-analysis was widely accepted in various fields such
as medicine, education, social sciences, and ecology. Prominent organizations like The
Cochrane Collaboration and the Campbell Collaboration have promoted systematic reviews and meta-analyses to inform evidence-based practices in healthcare and social
policy.
Despite its broad applications, traditional meta-analysis using published aggregate
data can face limitations, particularly when addressing heterogeneity in treatment effects
and effect modifiers. To overcome these, individual participant data meta-analysis is used,
employing either a one-stage or two-stage approach. The one-stage approach involves
pooling raw data from all studies into a single model, treating it as if it originated
from a single large study. This allows for complex statistical modeling of the data,
including interactions between treatment effects and patient characteristics. In contrast,
the two-stage approach first analyzes the data separately within each study to produce
study-specific estimates and then combines these estimates in a second stage using metaanalytic techniques. Each method has its advantages and considerations, with the choice
depending on the specific objectives and available data of the research.
Numerous studies across diverse medical and public health domains have utilized
20

IPD meta-analysis to refine treatment protocols and improve outcome predictions. In
medical and clinical trials, hypertension trials often employ IPD to assess the effects
of treatments on blood pressure, enabling more precise estimates of treatment effects
and the exploration of patient-level moderators [29]. Similarly, in cancer treatments,
IPD meta-analyses have provided in-depth insights into the effectiveness and patient
outcomes of colorectal cancer treatments [30]. In the realm of psychiatric and psychological research, studies such as those on antidepressant efficacy have used IPD to gain
a nuanced understanding of treatment responses across different patient subgroups, particularly in moderate depression [30]. Epidemiological studies, including research on the
relationship between maternal age and the risk of type I diabetes, leverage IPD to analyze detailed risk factors [30]. In chronic disease research, the Chronic Kidney Disease
Prognosis Consortium employs IPD meta-analyses to provide extensive data on disease
progression and patient outcomes in chronic kidney disease [31]. Additionally, in public
health and preventive medicine, IPD meta-analyses are instrumental in evaluating the
effectiveness of interventions aimed at preventing cardiovascular diseases, helping tailor
recommendations to individual patient characteristics [32]. These applications highlight
the critical role of IPD meta-analyses in producing precise and patient-specific results
across various health-related fields.

21

Chapter 3
Data
This study uses 11 available public churn datasets. Table 3.1 shows the datasets’ name,
source, and number of data records as well as the number of features in each dataset.
Table 3.1: Summary of Churn Datasets
Dataset Names

Sources

Entries

Features

Telecom Customer Churn

MavenAnalytics

7043

38

Internet Service Churn

Kaggle

72274

11

Bank Customer Churn Records

Kaggle

10000

18

Credit Card Churn

Kaggle

10127

21

E-commerce Churn

Kaggle

5630

20

Employee Churn

Kaggle

1070

35

Telco Europa

Kaggle

190776

20

Telcom Cell2Cell

Kaggle

71047

70

Membership Subscription

Kaggle

10362

15

Wireless Telecom South Asia

Kaggle

2000

14

Nigeria Telecoms Churn

Kaggle

1401

16

The first dataset that I examined is related to telecom customer churn. It can be
22

found on the Maven Analytics website. It consists of two CSV formatted tables. The
Customer Churn table contains details of 7,043 customers from a telecommunications
company in California during the quarter of 2022. Each entry in this dataset corresponds
to a customer and includes information about their demographics, location, tenure services subscribed to, and customer status among other relevant details. Additionally, the
Zip Code Population table offers population estimates for the California zip codes mentioned in the Customer Churn table.
Another dataset used for analysis is the internet service churn dataset available on
Kaggle. This dataset compiles customer data from an internet service provider with
a total of 72,274 records containing 11 attributes. These characteristics range, from
indicators like whether customers have subscribed to services such as TV and movie
packages to complex variables like how long they have been subscribed (subscription
age) the average amount they are billed (bill avg) and metrics related to service quality
(service failure count download avg and upload avg). There are instances of missing data
in attributes such as remaining contract details and internet speed measures (download
avg and upload avg) which means that some cleaning or filling in of data is needed before
further analysis can be done. The target feature is the churn attribute, which indicates
whether customers are continuing with or ending their services.
The other dataset named bank customer churn records which accessible on Kaggle.
This dataset contains 10,000 records and 18 features, outlining the characteristics of bank
clients. It covers information, financial details, and patterns of service usage including
name, location, gender, credit score, account balance, and number of banking products
used. The dataset includes factors related to customer satisfaction and engagement levels
such, as the Satisfaction Score, Complaints, and Exited statuses. It also covers details
like Card Type and Points Earned which may indicate how effective customer loyalty
programs are. There are no missing data in the dataset making the preprocessing phase
simple.
This study also utilizes the credit card churn dataset available on Kaggle. This

23

dataset has 10,127 entries with 23 attributes, offering insights into credit card customers’
profiles. The dataset contains both categorical variables, like Marital Status and Income
Category, and continuous variables such as Customer Age and Credit Limit. It also
includes demographic information like age, gender, and education level along with behaviors such as credit limit usage and total transaction amount. Additionally, it tracks
changes in customer status through features like Attrition Flag.
The dataset e-commerce churn is also used for this research and can be found on
Kaggle. It comprises 5,630 entries and 20 columns that represent customer attributes in
an e-commerce environment. This dataset contains a mix of numerical and categorical
data such as customer ID, gender, login device, payment methods, and cashback amount.
Some columns have missing values suggesting the need for data cleaning before analysis.
The dataset includes features like product purchase counts, customer demographics, and
transaction specifics that could aid in understanding customer behavior and predicting
churn in an e-commerce setting. The presence of a range of values, in product category
and transaction columns implies that the dataset captures a diverse array of customer
interactions and preferences.
Employee attrition data is also included in this study. This dataset is available on
Kaggle, which comprises 1,470 entries with 35 features. These features are categorical
and numerical, offering an overview of the employees. It contains the employee demographic features like age and gender, job specifics such as department, role, and monthly
earnings, and satisfaction indicators like work environment, job fulfillment, and work-life
balance. The target variable is Attrition, signaling whether employees are still part of
the organization or have moved on.
Another dataset included in this study is Telco Europa. This dataset is accessible
on Kaggle, having 190,776 entries with 20 attributes. These features include both categorical and continuous variables. These features include basic service usage metrics
and extend to geographical and technical dimensions. Notable categorical variables like
cni customer which represents customer identifiers, and churn, a binary indicator show-

24

ing whether a customer has discontinued the service or not. Continuous variables include
days life representing the tenure of the customer in days, device technology indicating
the type of device technology used, and various metrics related to call minutes such as
tot min call out and avg min in 3 for the average incoming call minutes over the last
three months. Other important variables are min plan and price plan, indicating the
minutes and cost of the customer’s plan, respectively. Additionally, it features technical
attributes for both data and voice services, such as tec ant data, state data, city data,
tec ant voice, state voice, and city voice.
The Telcom Cell2Cell dataset is another dataset that was used in this study. This
dataset is downloaded from the Kaggle website. It provided by Duke University, consists
of 71,047 records and 70 attributes focused on customer churn within a telecommunications framework. The dataset captures a wide array of information, including basic
churn indicators, detailed usage metrics such as minutes of use, charges for overages and
roaming, and customer service interactions including call quality metrics like dropped
and blocked calls. Additionally, it provides insights into customer demographics and
equipment usage, featuring variables that describe age, device usage duration, and possession of web-capable devices.
The Membership Subscription dataset was also used in this project. This dataset
contains 10,362 entries, detailing various aspects of membership within an organization,
encapsulated in 15 attributes. These include the membership number, annual fees, and
detailed member demographics such as marital status, gender, annual income, and occupation. Key attributes further encompass the type of membership package, the number
of additional members, and the respective payment mode. Each entry also tracks the
membership life-cycle through start and end dates and the membership status, providing
a rich basis for analyzing membership retention.
The South Asian Wireless Telecom Churn dataset comprises 2,000 records with
14 attributes that detail various aspects of customer usage and interaction with a telecom
service provider. It includes metrics such as network age, total revenue, and revenue

25

specifics from SMS and data usage, alongside total data volumes. Notable features
include categorical variables like user type and favorite services for different months,
and continuous variables such as the number of calls made, revenue from on-net and offnet calls, and the count of customer complaints. Furthermore, the dataset tracks changes
in customer status through the Class attribute, which indicates whether customers have
continued or discontinued their services.
The Nigeria Telecoms Churn dataset encompasses 1,401 entries with 16 attributes,
detailing the telecom usage and interactions of customers in Nigeria. It includes comprehensive metrics such as network age, customer tenure, total expenditure over two
months, SMS and data spending, and data consumption. The dataset further assesses
customer engagement through metrics like the total number of unique calls, spending on
calls within and outside the network, and the number of complaint calls to the service
center. Additionally, it captures the network technology type preferred by customers
across two months and their favored competitor networks during the same period. A
pivotal aspect of this dataset is the Churn Status, which indicates whether customers
have remained with or left the service provider, providing critical insights for churn analysis and customer retention strategies.

26

Chapter 4
Methodology
This research adopts an individual participant data meta-analysis approach to evaluate
the effectiveness of various machine learning models in predicting customer churn across
multiple datasets. This methodology allows for a more nuanced analysis by integrating
raw data from different studies, thus facilitating a comprehensive comparison and validation of predictive models under varying conditions. Since the public datasets have a
large variation in the number of features and feature names, the two-stage approach in
IDP-MA is applied in this project. In the first stage, each study (dataset) analyzes the
IPD separately to obtain aggregate data. This includes ML performance effect estimates
and standard errors. In the second stage, the aggregate data obtained from the first stage
are then combined in a standard meta-analysis model. In this project, both commoneffect and random-effects model will be applied. The figure 4.1 shows the workflow of this
project research methodology. In which, the first stage analysis processes is represented
in blue box while second stage is in the green box.

27

Figure 4.1: Two-stage IPD-MA Workflow

4.1

Individual Dataset - Exploratory Data Analysis
and Data Processing

4.1.1

Exploratory Data Analysis (EDA)

This EDA process was similarly applied to all 11 datasets, with visualizations and statistical analyses tailored to the specific characteristics of each dataset. The EDA process
included creating summary data tables for each dataset to identify unique values, missing
values, NaN values, duplicated entries, and data types. It also involved visualizing the
distribution of the target variable to identify class imbalances and analyzing the distribution of categorical features to understand customer preferences and service uptake.
Additionally, the distribution of numerical features was examined to assess data range,
central tendency, and dispersion. Finally, the correlation matrix was evaluated to identify multicollinearity and guide feature selection. These steps ensure a robust exploratory
analysis, providing valuable insights that inform the predictive modeling and strategic
decision-making processes.
We begin by creating a summary data table for each dataset. The summary data table
is a valuable tool in the exploratory data analysis phase, providing a concise overview

28

Table 4.1: Summary of Internet Churn Dataset
Feature

Unique

Missing

NaN

Duplicated

Dtypes

reamining contract

247

21572

21572

0

float64

download avg

2856

381

381

0

float64

upload avg

802

381

381

0

float64

id

72274

0

0

0

int64

is tv subscriber

2

0

0

0

int64

is movie package subscriber

2

0

0

0

int64

subscription age

1110

0

0

0

float64

bill avg

179

0

0

0

int64

service failure count

19

0

0

0

int64

download over limit

8

0

0

0

int64

churn

2

0

0

0

int64

of the dataset’s characteristics. These tables highlight the number of unique values,
missing values, NaN values, duplicated entries, and data types for each feature. This
information is crucial for identifying categorical versus continuous features, guiding the
handling of missing data, and ensuring appropriate data types for analysis and modeling.
For instance, the table 4.1 reveals that the reamining contract feature has a significant
number of missing values, necessitating imputation or exclusion, while binary features like
is tv subscriber are confirmed as categorical. Additionally, understanding the extent of
missing data and identifying features with high cardinality or duplicates helps prioritize
data cleaning efforts and informs feature engineering decisions.
The distribution of the target variable is a crucial aspect of EDA as it provides valuable insights into the balance or imbalance within the dataset. This information is essential for understanding the dataset’s characteristics and informing subsequent modeling
decisions. Figure 4.2 effectively illustrates the varying distributions of target variables
across 11 different datasets, providing clear examples of where imbalances exist. For
instance, the Telecom Customer Status plot in figure 4.2 shows a significant difference
29

Figure 4.2: Target Distribution.
between the number of customers who stayed versus those who churned, indicating a
class imbalance. Identifying such imbalances is critical because they can bias the predictive model, leading to poor performance in predicting the minority class. Similarly,
the Credit Card Attrition plot reveals an imbalance with more existing customers than
attrited customers, highlighting the need for strategies to handle such disparities in data.
Understanding these imbalances helps in selecting appropriate evaluation metrics like
precision, recall, F1-score, and ROC-AUC, which are more informative for imbalanced
datasets.

30

Figure 4.3: Categorical Features Distribution in Credit Card Dataset.
Next, the distribution of categorical features is crucial in exploratory data analysis
as it provides valuable insights into the composition and characteristics of the dataset.
Understanding these distributions aids in feature engineering and selection, allowing me
to decide which features to include in their models and how to preprocess them effectively.
Figure 4.3 displays the distribution of several categorical features from the credit card
dataset. Each subplot provides a count plot for a specific categorical feature, revealing
key patterns and distributions within the data. For instance, the Distribution of Gender
subplot shows a nearly balanced distribution between female and male customers, with
a slight predominance of females. While, the Distribution of Education Level indicates
that most customers are graduates, followed by high school graduates and those with
unknown educational backgrounds, while the least represented are doctorate holders.
The distribution of numerical features is an essential part of EDA. This analysis
provides a comprehensive understanding of the dataset’s range, central tendency, and

31

Figure 4.4: Numerical Features Distribution in Bank Dataset.
dispersion. For example, histograms and Kernel Density Estimation plots can reveal
whether the data is normally distributed, skewed, or contains outliers, which is crucial
for deciding on the appropriate statistical methods and data transformations to apply.
This step ensures that the numerical data is accurately represented and ready for further
analysis and modeling. Figure 4.4 displays the distribution of several numerical features
from the bank customer dataset. The Distribution of CreditScore subplot shows an
approximately normal distribution, centered around a mean value with a slight skew
towards higher scores, indicating most customers have credit scores between 500 and
800, peaking around 700. The Distribution of Age is right-skewed, with the majority of
customers aged between 30 and 50.
The heatmap, which visualizes the correlation matrix, is another powerful tool in
EDA. It helps in identifying multicollinearity among numerical features by displaying the
correlation coefficients between pairs of variables. High correlations indicate redundancy,
which can affect the performance of predictive models. For instance, if two features are
highly correlated, one might be dropped or transformed to avoid multicollinearity issues
in the model. The heatmap provides a clear and immediate visual representation of these
relationships, making it easier to spot and address potential problems. By understanding

32

Figure 4.5: Correlations Among Numerical Features in Telcom Churn Dataset.
the correlations, data scientists can make informed decisions about feature selection and
engineering, ensuring that the models are both efficient and effective. Figure 4.5 shows
the correlation coefficients among numerical variables in the telecommunications data
set. It can be seen that there is a strong correlation between total revenue and total
charges, with a correlation of 0.97.

4.1.2

Data Preparation

Data preparation is the next step after data collection and before machine learning model
building in this workflow. It includes data splitting, missing value treatment, scaling
numerical variables, encoding categorical features, imbalanced data handling, etc. All
these are vital steps for data preparation before training a machine learning algorithm.

33

Data Splitting

Splitting the dataset into train and test sets is done first in the data preparation to avoid
any potential data leakage during this step. Data leakage occurs when training models
can access information outside of the training dataset [33]. This can lead to overestimating models’ performance during the training process. Once the model is deployed in
production, its performance can decrease significantly. The predictions turn out to be
far less reliable than the performance of the model during training would suggest. Steps
can be taken to eliminate what is called data leakage so that the performance during
training is a better reflection of what the model can do with data it has not seen already.
Since the telecom customer churn dataset is unbalanced data, stratified splitting was
applied to ensure that the class distribution of training and test sets is similar [34].
The initial dataset was split into 75 percent and 25 percent for the train and test sets,
respectively. The random state is set to make sure the experiment result is reproducible.

Handling Missing Values

First, some non-informative columns, namely, customer id, city, zip code, latitude, and
longitude were dropped, assuming that these variables simply do not offer predictive
power. In addition, direct churn-related columns, churn category, and churn reason were
also removed because they are directly related to churn and only available after customer
churn. If these two columns were included in the model, it would introduce a very
optimistic model for churn prediction.
The process continues with the imputation of missing values for categorical variables.
As pointed out in the previous exploratory data analysis, for service-related features like
multiple lines, internet type, and online security, among others, we will fill the missing
entries with No because these customers do not have a service. By replacing missing
categorical data with meaningful labels, we avoid introducing bias arising from dropping
rows that contain missing data, which may lead to the loss of important information.
34

For a continuous variable such as average monthly long distance charges or average
monthly GB download, the imputation method used is zero filling. The rationale is that
these customers do not subscribe to phone or internet service.

Scaling Numerical Variables

The numerical columns were scaled using min-max scaling. In this study, the MinMaxScaler function from the sklearn.preprocessing module is used to implement the scaling
task. It calculates the difference between each value and the minimum feature value and
then divides it by the difference between the maximum and minimum feature values.
This scaling process transformed all the numerical features into a range between 0 and 1,
which would help many machine learning algorithms that are very sensitive to the scale
of the input features [35].

Encoding Categorical Variables

In the encoding phase, binary categorical variables such as gender, married, phone service,
and paperless billing are manually mapped to numeric binary values, with a straightforward mapping of female and yes to 1, and male and no to 0. This binary encoding
transforms the categorical data into a format suitable for machine learning algorithms,
which require numerical input.
The remaining categorical variables undergo one-hot encoding, a process that converts categorical variable values into a form that could be provided to machine learning
algorithms to do a better job at prediction. The OneHotEncoder creates a binary column
for each category and returns a sparse matrix or dense array [25]. By employing one-hot
encoding, the model can interpret the data without falsely attributing order or priority
where it does not exist [36].

35

4.2

Individual Dataset - Models Evaluation

4.2.1

Predictive Modeling

In this study, a large number of machine learning models were implemented to find which
model was best suited to the research tasks. Many of these models are ensembles that
consist of other machine learning algorithms.
First, logistic regression is selected for the model benchmark. Then the support
vector machine was also included. This model works well in high dimensions without
scaling. For comparability purposes, KNN, gaussian naive bayes, decision trees, and
random forest classifiers were selected. Among these algorithms, KNN is a simple model
that also works effectively in classification. The Gaussian Naive Bayes classifier is a
naturally probabilistic approach known for its overall good performance in probabilistic
classification. The decision tree is an easy-to-interpret model, and the random forest is
robust and deals with overfitting.
As an additional benchmark, several boosting algorithms were also implemented,
including gradient boosting, AdaBoost, and bagging classifiers. These ensemble models
are built from many weak learner models into strong ones. A multi-layer perceptron
(MLP) classifier, which is an artificial neural network, was also employed. This model is
capable of inferring highly non-linear relationships. Moreover, this study includes some
highly sophisticated ensemble methods, such as XGBoost and LightGBM, that have
proven to be very fast and highly competitive in many machine learning competitions.
Each of the models was initialized with a random seed to ensure that the reproduction
of the results was reproducible. The model’s performance is evaluated using multiple
different metrics: accuracy, precision, recall, and F1 score, to give the best evaluation
available. The final model will be decided upon by weighing all the performance measures
as well as other considerations such as interpretability and computational efficiency.

36

4.2.2

Classifiers Performance Evaluation

All the models were evaluated by stratified k-fold cross-validation with 5 splits. This
approach ensures that the number of samples for each class is about the same in each
split. Such splitting can better evaluate the model performance than a single train-test
split, especially for imbalanced data. The metrics of model evaluation are accuracy,
precision, recall, F1 score, and ROC-AUC.
For each step of the cross-validation, the initial training data was split into training
and validation sets. Then the model was trained on the training set and evaluated on
the validation set, and the obtained score for each metric was recorded. This process
was repeated for all training-validation splits, and the obtained set of scores was then
averaged across folds to get a more stable estimate of the model’s performance.
After cross-validation, the models were retrained on the entire initial training set and
evaluated on the unseen test set to assess generalization performance. The models were
then evaluated on the test set using the same metrics used in the previous evaluation.

4.2.3

Hyperparameter Tuning

GridSearchCV and RandomisedSearchCV were used for hyperparameter tuning. This
process was aimed at optimizing the hyperparameters to improve the model’s performance. GridSearchCV was used to do an exhaustive search over the parameter grid for
each classifier. For each classifier, it evaluates all possible combinations of parameters
and picks the best. While, RandomizedSearchCV used a fixed number of parameters
randomly sampled. It is less intensive to compute than GridSearchCV, but sometimes it
can find a good approximation of the best parameters.
Once the tuning was done, the best parameters for each classifier were determined
and re-trained on the entire training dataset. Finally, the retrained models were assessed
on the unseen test dataset, where the ability to generalize can be estimated.
37

4.3

Aggregate Effect Size and Standard Error

Once the performances of various machine learning algorithms on individual datasets
were obtained, aggregate metrics were derived for meta-analysis. This approach provides
a comprehensive understanding of the algorithms’ performances across multiple datasets
and allows for a robust comparison.

Aggregate Effect Size

The aggregate effect size is a critical metric in meta-analysis, offering a consolidated
measure of the performance of each ML algorithm across all datasets. This was achieved
by calculating the average performance metric of each ML algorithm across all datasets.
The mean AUC will be used as an aggregated effect size for the next step meta-analysis.
This aggregate measure helps in understanding the overall efficacy of each algorithm and
identifying which performs best on average. Additionally, the average performance of all
ML algorithms on each individual dataset was computed, providing insights into which
dataset presented the greatest challenges or the easiest conditions for the algorithms.

Standard Error

Standard error (SE) plays a crucial role in understanding the variability and reliability of
the aggregate effect size. It measures the precision of the average performance estimates.
For each ML algorithm’s aggregate performance, the SE was calculated to quantify the
dispersion of performance metrics across the datasets. This involves computing the standard deviation of the performance metrics for each algorithm and dividing it by the
square root of the number of datasets. Similarly, the SE for the average performance of
all ML algorithms on each individual dataset was calculated. This helps in assessing the
consistency of the algorithms’ performances on a particular dataset and the robustness
of the aggregate metrics.
38

Calculating Aggregate Performance for Each ML Algorithm Across All Datasets

1. For each ML algorithm, the performance metrics from all datasets were collected.
2. The mean performance metric for each algorithm was computed.
3. The standard deviation of these performance metrics was calculated.
4. The SE for each algorithm’s average performance was derived using the formula:
SE =

Standard Deviation
√
N

(4.1)

where N is the number of datasets.

Calculating Aggregate Performance for All ML Algorithms on Each Dataset

1. For each dataset, the performance metrics of all ML algorithms were collected.
2. The mean performance metric for each dataset was computed.
3. The standard deviation of these performance metrics was calculated.
4. The SE for each dataset’s average performance was derived using the same formula:
SE =

Standard Deviation
√
N

(4.2)

where N is the number of evaluated ML models.

4.4

Estimating Pooled Effect Size and Confidence
Interval

4.4.1

Fixed-Effect Model

The fixed-effect model assumes that the effect size is consistent across all studies or
datasets, and any observed variation is just due to sampling error. The following section
39

describes the steps and equations to estimate the pooled effect size and its confidence
interval (CI) from the aggregated mean AUC and standard error.

1. Compute the Weights:
• Calculate the variance of each effect size:
vi = SEi2

(4.3)

• Assign a weight (wi ) to each effect size, which is the inverse of its variance:
wi =

1
vi

(4.4)

• The weight reflects the precision of each effect size, giving more importance
to effect sizes with smaller standard errors.
2. Calculate the Pooled Effect Size:
ˆ C) is a weighted average of the individual effect
• The pooled effect size (AU
sizes:
ˆC =
AU

Pk

ˆ

i=1 wi AU Ci
Pk
i=1 wi

(4.5)

where k is the number of datasets or studies.
3. Estimate the Variance of the Pooled Effect Size:
• The variance of the pooled effect size (VAUˆ C ) is calculated as:
VAUˆ C = Pk

1

i=1 wi

(4.6)

4. Compute the Standard Error of the Pooled Effect Size:
• The standard error (SE) of the pooled effect size is the square root of its
variance:
SEAUˆ C =

p
VAUˆ C

(4.7)

5. Construct the Confidence Interval:
40

• The 95% confidence interval for the pooled effect size is calculated using the
standard error and the critical value from the standard normal distribution
(usually 1.96 for a 95% CI):
ˆ C ± 1.96 × SE ˆ
CI95% = AU
AU C

4.4.2

(4.8)

Random-Effect Model

The random-effects model assumes that the effect sizes vary across studies or datasets
due to real differences in effects, as well as sampling error. This model incorporates both
within-study and between-study variability to provide a more generalizable estimate of
the pooled effect size and its confidence interval (CI). The following section describes the
process and equations to estimate the pooled effect size and its CI from aggregated mean
AUC and standard error.

1. Compute the Weights:
• Calculate the variance of each effect size:
vi = SEi2

(4.9)

• Estimate the between-study variance (τ 2 ) using the DerSimonian and Laird
method ([37], [28]):


Q − (k − 1)
τ = max 0,
C
2

where
Q=

k
X

(4.10)

wi∗ (AUˆCi − AUˆCw )2

(4.11)

1
vi

(4.12)

i=1

wi∗ =
AUˆCw =

C=

k
X
i=1

Pk

∗
ˆ
i=1 wi AU Ci
Pk
∗
i=1 wi

wi∗ −

(4.13)

Pk

∗ 2
i=1 (wi )
Pk
∗
i=1 wi

(4.14)
41

• Compute the random-effects weights (wi ):
wi =

1
vi + τ 2

(4.15)

2. Calculate the Pooled Effect Size:
ˆ C) is a weighted average of the individual effect
• The pooled effect size (AU
sizes:
ˆC =
AU

Pk

ˆ

i=1 wi AU Ci
Pk
i=1 wi

(4.16)

where k is the number of datasets or studies.
3. Estimate the Variance of the Pooled Effect Size:
• The variance of the pooled effect size (VAUˆ C ) is calculated as:
VAUˆ C = Pk

1

i=1 wi

(4.17)

4. Compute the Standard Error of the Pooled Effect Size:
• The standard error (SE) of the pooled effect size is the square root of its
variance:
SEAUˆ C =

p
VAUˆ C

(4.18)

5. Construct the Confidence Interval:
• The 95% confidence interval for the pooled effect size is calculated using the
standard error and the critical value from the standard normal distribution
(usually 1.96 for a 95% CI):
ˆ C ± 1.96 × SE ˆ
CI95% = AU
AU C

4.5

(4.19)

Residual Heterogeneity Estimates

Residual heterogeneity refers to the variability in effect sizes that remains unexplained
after accounting for within-study variance in meta-analysis. This section outlines the
42

methodology for estimating residual heterogeneity in both fixed-effect and random-effect
models.

4.5.1

Test for Heterogeneity

The test for heterogeneity examines whether the variability observed among study estimates is greater than what would be expected by chance alone. In this context, the null
hypothesis (H0 ) is that all studies share a common effect size, suggesting no significant
heterogeneity. The alternative hypothesis (HA ), on the other hand, proposes that there
is significant heterogeneity, indicating that the variability among study estimates is not
solely due to sampling error but also due to true differences in effect sizes.
The Q statistic is used to test for heterogeneity, and a significant Q statistic, particularly with a low p-value, supports the rejection of the null hypothesis, confirming the
presence of heterogeneity among the studies. The Q-statistic follows a chi-square distribution with k − 1 degrees of freedom, where k is the number of studies. A significant
Q-statistic (p-value < 0.05) indicates the presence of heterogeneity.

4.5.2

Fixed-Effect Model

In a fixed-effect model, it is assumed that there is a single true effect size that is common
across all studies, and any observed variation in effect sizes is due solely to within-study
sampling error. However, if there is unexplained heterogeneity, it can be assessed using
the following steps:

1. Calculate the Q-statistic: The Q-statistic tests for heterogeneity by comparing
the observed variability in effect sizes to what would be expected by chance alone.
It is calculated as:
Q=

k
X

wi (AUˆCi − AUˆCw )2

(4.20)

i=1

43

where wi is the weight assigned to each effect size (the inverse of the variance),
AUˆCi is the observed effect size for study i, and AUˆCw is the weighted average
effect size.
2. Calculate the I2 Statistic: The I2 statistic quantifies the proportion of total
variation in effect sizes that is due to heterogeneity rather than chance. It is
calculated as:
I2 =

Q − (k − 1)
× 100%
Q

(4.21)

An I2 value greater than 50% typically indicates substantial heterogeneity.
3. Calculate the H2 Statistic: The H2 statistic represents the ratio of total variability to sampling variability. It is calculated as:
H2 =

Q
k−1

(4.22)

H2 is a measure of heterogeneity, with values greater than 1 indicating the presence
of variability beyond what would be expected by sampling error alone.

4.5.3

Random-Effect Model

In a random-effect model, it is assumed that there is not a single true effect size, but
rather a distribution of true effect sizes across studies. The model accounts for both
within-study variance and between-study variance, often referred to as tau-squared (τ 2 ).
The steps to estimate residual heterogeneity in a random-effect model are as follows:

1. Estimate Between-Study Variance (τ 2 ): The DerSimonian and Laird method
is commonly used to estimate the between-study variance. It is calculated as:


Q − (k − 1)
2
τ = max 0,
(4.23)
C
where
Q=

k
X

wi∗ (AUˆCi − AUˆCw )2

(4.24)

i=1

44

wi∗ =
AUˆCw =

C=

k
X

1
vi

(4.25)

Pk

∗
ˆ
i=1 wi AU Ci
Pk
∗
i=1 wi

wi∗ −

i=1

(4.26)

Pk

∗ 2
i=1 (wi )
Pk
∗
i=1 wi

(4.27)

2. Compute the random-effects weights (wi ):
wi =

1
vi + τ 2

(4.28)

3. Compute the Random-Effects Weights: Adjust the weights to account for
between-study variability:
wi =

1
vi + τ 2

(4.29)

where vi is the within-study variance.
4. Calculate the I2 and H2 Statistics: Similar to the fixed-effect model, I2 is used
to quantify the proportion of total variation due to heterogeneity, and H2 represents
the ratio of total variability to sampling variability.
I2 =

Q − (k − 1)
× 100%
Q

(4.30)

Q
k−1

(4.31)

H2 =

4.6

Funnel Plots

Funnel plots are a graphical tool widely used in meta-analyses to assess publication
bias and heterogeneity among studies. Funnel plots are constructed by plotting the
performance measure on the horizontal axis against a measure of study precision or size
on the vertical axis. In machine learning evaluations, the mean performance metric of a
model is plotted against the inverse of the standard error of the performance measure.
This arrangement allows for the visual detection of asymmetry in data that might suggest
45

potential biases. In this research, funnel plots are employed to visually assess the bias
and variance in the performance of machine learning models. Two types of plots were
prepared. Funnel plots for each machine learning model across multiple datasets and
funnel plots for each dataset across different machine learning models. These plots are
critical for identifying whether certain models or datasets consistently perform differently
from the general trend and whether there are any signs of publication bias in reporting
model performances.

46

Chapter 5
Results

5.1

Aggregated Data

This section presents the aggregated results of the mean AUC, standard deviation (SD),
and standard error (SE) for each machine learning (ML) model and each dataset. The
results provide a general understanding of the performance metrics across various ML
models and datasets, highlighting both the models’ overall effectiveness and variability.
This data will be used as input data for the second stage of meta-analysis.

5.1.1

Performance Metrics by Machine Learning Model

Table 5.1 provides the mean AUC, standard deviation, and standard error for each
ML model. These metrics reflect the aggregated performance of each model across all
datasets.
Catboost achieves the highest mean AUC (0.9036) among the ML models, indicating
its strong overall performance across the datasets. Lightgbm (0.9000), Gradient Boosting
(0.8936), and Random Forest (0.8891) also demonstrate high mean AUCs, suggesting

47

Table 5.1: Mean AUC, Standard Deviation, and Standard Error for Each Model
Model

Mean AUC

Standard Deviation (SD)

Standard Error (SE)

AdaBoost

0.8836

0.0808

0.0244

Bagging

0.8682

0.0964

0.0291

Catboost

0.9036

0.0836

0.0252

Decision Tree

0.7609

0.1291

0.0389

GaussianNB

0.7882

0.0708

0.0214

Gradient Boosting

0.8936

0.0794

0.0239

KNeighbors

0.7845

0.1085

0.0327

Lightgbm

0.9000

0.0860

0.0259

Logistic Regression

0.8345

0.0859

0.0259

MLP

0.8745

0.0929

0.0280

Random Forest

0.8891

0.0884

0.0266

XGboost

0.8918

0.0923

0.0278

their effectiveness in handling diverse datasets. The Decision Tree model shows the lowest
mean AUC (0.7609) and the highest standard deviation (0.1291), indicating significant
variability in its performance. The high standard error (0.0389) for the Decision Tree
further confirms this variability.

5.1.2

Performance Metrics by Dataset

Table 5.2 summarizes the mean AUC, standard deviation, and standard error for each
dataset. These metrics reflect the aggregated performance of all ML models on each
dataset.
The Internet dataset has the highest mean AUC (0.9508) with a relatively low standard deviation (0.0456) and standard error (0.0132), indicating consistent performance
across ML models. Similarly, the Credit Card dataset also exhibits a high mean AUC
48

Table 5.2: Mean AUC, Standard Deviation, and Standard Error for Each Dataset
Dataset

Mean AUC

Standard Deviation (SD)

Standard Error (SE)

Telcom

0.9042

0.0476

0.0137

Internet

0.9508

0.0456

0.0132

Bank

0.8242

0.0570

0.0164

Credit Card

0.9517

0.0491

0.0142

E-Commerce

0.9467

0.0557

0.0161

Employee

0.8125

0.0885

0.0256

Telco Europa

0.8942

0.0763

0.0220

Cell2Cell

0.8167

0.0587

0.0169

Membership

0.6700

0.0413

0.0119

SA Wireless Telcom

0.8000

0.0644

0.0186

Niger Telcom

0.8458

0.0799

0.0231

(0.9517), demonstrating the models’ strong performance on this dataset. On the other
hand, the Membership dataset has the lowest mean AUC (0.6700), suggesting that the
models performed less effectively on this dataset. The Employee dataset shows the highest standard deviation (0.0885) and standard error (0.0256), indicating higher variability
in model performance.

5.2

Fixed- and Random-Effect Model Results

This section presents the results of the meta-analysis conducted using both fixed-effect
and random-effect models for each machine learning model and each dataset. The forest
plots generated from these analyses are discussed to provide a comprehensive understanding of the performance metrics and their variability across ML models and datasets.

49

5.2.1

Meta-Analysis by Machine Learning Model

Figure 5.1: Fixed-Effect Model Forest Plot by ML.
The fixed-effect model forest plot (Figure 5.1) indicates that Catboost and Lightgbm
have the highest mean AUCs (both around 0.90), with relatively narrow confidence intervals, suggesting consistent high performance across datasets. The Decision Tree model
has the lowest mean AUC (0.76) with a wide confidence interval, indicating significant
variability in performance. The overall fixed-effect model estimate for ML models is 0.86
[0.85, 0.88], representing the combined performance across all models. These results indicate a high average performance (AUC) with narrow confidence intervals, suggesting
precise estimates. However, the significant heterogeneity suggests that the fixed-effect
model may not fully account for the variability across studies.
The random-effect model forest plot (Figure 5.2) shows similar results, with Catboost
and Lightgbm still performing the best. The overall random-effect model estimate is
0.86 [0.83, 0.89], slightly lower than the fixed-effect model, reflecting the incorporation

50

Figure 5.2: Random-Effect Model Forest Plot by ML.
of between-model variability. The Decision Tree model continues to show the lowest
performance with considerable variability.

5.2.2

Meta-Analysis by Dataset

The fixed-effect and random-effect models were applied to the aggregated performance
data for each dataset. The forest plots for these models are shown in Figures 5.3 and 5.4
The fixed-effect model forest plot (Figure 5.3) shows that the Credit Card, Internet,
and E-Commerce datasets have the highest mean AUCs, all around 0.95. These datasets
demonstrate narrow confidence intervals, indicating consistent performance across different ML models. The Membership dataset has the lowest mean AUC (0.67), with a
narrower confidence interval, indicating less variability in performance but consistently
lower AUC values. The overall fixed-effect model estimate is 0.85 [0.84, 0.86], representing the combined performance across all datasets.
51

Figure 5.3: Fixed-Effect Model Forest Plot by Dataset.
These results indicate a high average performance (AUC) with narrow confidence
intervals, suggesting precise estimates. The very high heterogeneity suggests that the
fixed-effect model may not fully account for the variability across datasets.
The random-effect model forest plot (Figure 5.4) provides a similar view but accounts
for between-study variability. The overall random-effect model estimate is slightly higher
at 0.86 [0.81, 0.91], reflecting the incorporation of heterogeneity among datasets. The
Credit Card and Internet datasets maintain high mean AUCs with slightly wider confidence intervals, indicating some variability in model performance across these datasets.
In general, the meta-analysis results highlight the variability in performance across
different ML models and datasets. Among the ML models, Catboost, Lightgbm, and
Gradient Boosting are the top performers, whereas the Decision Tree model shows less
consistent results. The Credit Card, Internet, and E-Commerce datasets generally exhibit
high AUCs, while the Membership dataset poses more challenges. These insights provide
a clear understanding of how different datasets and models perform, guiding future model
52

Figure 5.4: Random-Effect Model Forest Plot by Dataset.
selection and optimization efforts.

5.3

Residual Heterogeneity Estimates

5.3.1

Residual Heterogeneity Estimates by Machine Learning
Model

For the fixed-effect model analysis, which assumes a common effect size across all studies,
the Q statistic was calculated to be 35.2159 with 11 degrees of freedom, resulting in a
p-value of 0.0002. This significant result indicates substantial heterogeneity among the
studies, leading to the rejection of the null hypothesis. The I2 statistic, which quantifies the proportion of total variability due to heterogeneity, was found to be 68.76%,
indicating a considerable level of heterogeneity. The H2 statistic, which measures total

53

variability relative to sampling variability, was calculated as 3.20, further suggesting notable heterogeneity.
In the random-effect model analysis, which accounts for both within-study and
between-study variability, similar patterns of heterogeneity were observed. The I2 statistic remained high at 68.57%, and the H2 value was 3.18. Additionally, the tau-squared
(Tau2 ) value, representing the total amount of heterogeneity, was estimated at 0.0016
with a standard error of 0.0010, and the square root of Tau2 (Tau) was 0.0395. The Q
statistic was consistent with the fixed-effect model, reinforcing the conclusion of significant heterogeneity.
In the analysis by dataset using the fixed-effect model, a high degree of heterogeneity
was evident. The Q statistic was exceptionally high at 408.4644 with 10 degrees of
freedom, yielding a p-value of less than 0.0001, indicating significant heterogeneity among
the datasets. The I2 statistic was calculated to be 97.55%, highlighting an extremely high
level of heterogeneity. The H2 statistic was 40.85, suggesting that the observed variability
greatly exceeds what would be expected due to sampling variability alone.
The random-effect model analysis produced similar findings. The I2 statistic was
slightly lower at 96.44%, still indicating substantial heterogeneity. The H2 value was
28.13, again suggesting significant variability beyond sampling error. The Tau2 value
was estimated at 0.0072 with a standard error of 0.0033, and the square root of this
value (Tau) was 0.0846. The Q statistic remained consistent with the fixed-effect model,
further confirming the presence of significant heterogeneity across the datasets.
In summary, the residual heterogeneity estimates, as measured by I2 , H2 , and the
Q statistic, indicate that both the ML model and dataset analyses exhibit substantial
heterogeneity. This suggests that the observed variability among the studies is due to
true differences in effect sizes rather than mere sampling error.

54

5.4

Funnel Plots Analysis

5.4.1

Funnel Plot Analysis by Machine Learning Model

Figure 5.5: Fixed-Effect Model Funnel Plot by Machine Learning Model.
The fixed-effect model’s funnel plot (Figure 5.5) displays a tight clustering of machine
learning models around the peak mean AUC value of approximately 0.9. Although
this concentration suggests a high degree of consistency in model performance, a slight
asymmetry in the distribution of points around the vertical pooled effect line indicates
a potential minor bias in the pooled effect estimation that is slightly leaning toward
high performance models. In addition to the traditional interpretation of publication
bias using the funnel plot, in the context of this study, the funnel plot can be used
to select models with the optimum bias and variance trade-off. Notably, models like
Catboost and LightGBM, which are situated at the peak, achieve top AUC scores with
minimal standard errors, showcasing their effectiveness in balancing bias and variance.
In contrast, models such as Decision Tree and KNeighbors, appearing towards the lower
55

end of the spectrum, exhibit variability that may reflect their sensitivity to training data
variations or inherent methodological limitations.

Figure 5.6: Random-Effect Model Funnel Plot by Machine Learning Model.
The random effect model’s funnel plot (Figure 5.6) also shows a symmetrical distribution of machine learning models around a high mean AUC value. The slight asymmetry
around the vertical pooled effect line in this plot also suggests a potential minor bias in
the pooled effect estimation. This observation supports the use of the funnel plot not
only for detecting potential biases in pooled effect estimation but also for choosing models
that maintain an optimal balance between bias and variance. High-performing models
such as Catboost and LightGBM continue to demonstrate top-tier performance with low
variability, while models like Decision Tree and KNeighbors display greater dispersion,
highlighting their potential instability and sensitivity to different data scenarios.

56

Figure 5.7: Fixed-Effect Model Funnel Plot by Dataset.

5.4.2

Funnel Plot Analysis by Dataset

The fixed effect model’s funnel plot by dataset (Figure 5.7) illustrates a symmetric distribution around the pooled mean AUC line, indicating a balanced and unbiased representation of datasets. The consistency and high performance of datasets such as Telco
Europa and Telecom, which are noted for high mean AUC values with minimal standard
errors, demonstrate their robustness and reliability. Conversely, datasets like Membership, which show lower mean AUC values and higher variability, may suggest specific
challenges affecting the models’ performance.
The random effect model’s funnel plot (Figure 5.8) maintains the symmetry observed
in the fixed effect model, confirming the absence of bias in the dataset evaluations. This
plot emphasizes the strong and consistent performance across datasets such as Internet,
Credit Card, and Membership, demonstrating their suitability for testing models in various analytical conditions, with high mean AUC values and minimal variability indicating

57

Figure 5.8: Random-Effect Model Funnel Plot by Dataset.
stable model performance.

5.5

Feature Importance

The feature importance analysis was conducted using the top three performing machine
learning models: Catboost, LightGBM, and Gradient Boosting, across multiple datasets.
This analysis aimed to identify the key drivers of customer churn. Figures 5.9 to 5.19
illustrate the top 10 most important features as determined by each model for various
datasets.
In the Catboost model, significant patterns were observed across the datasets. For
instance, in the Bank dataset, features like Age, Balance, and CreditScore were identified
as the most critical, emphasizing the role of financial stability and customer demographics in predicting churn. Similarly, the Credit Card dataset highlighted the importance
of transaction behavior, with Total Trans Ct and Total Trans Amt emerging as top fea58

Figure 5.9: Top 10 Feature Importance in Catboost, LightGBM, and Gradient Boosting
for Bank Dataset.
tures. In the E-Commerce dataset, features such as Tenure and NumberOfAddress were
crucial, suggesting that customer loyalty and geographic stability significantly influence
churn. The Employee dataset’s key features were Age and OverTime No, pointing to the
impact of employee demographics and work patterns on churn. For the Internet dataset,
the most important features were Remaining contract and bill avg, underscoring the
significance of contract terms and billing amounts in customer retention. In the Membership dataset, the analysis revealed that features like MEMBERSHIP TERM YEARS,
ANNUAL FEES, and MEMBER ANNUAL INCOME were the top predictors, suggesting that financial stability and membership tenure are key factors influencing churn. The
Nigeria Telecom dataset emphasized the importance of usage and spending patterns, with
Total Spend in Months 1 and 2 of 2017 and network age emerging as significant features.
The SA Wireless dataset showed that Aggregate Total Rev, Aggregate SMS Rev, and
network age were critical in predicting churn, pointing to the role of revenue and usage
behavior. Other datasets, such as Telco Europa and Telecom, consistently showed that
Tenure and ContractMonthtoMonth were significant, indicating the importance of contract types and customer retention over time.
In the LightGBM model, financial and usage-related features were consistently impor59

Figure 5.10: Top 10 Feature Importance in Catboost, LightGBM, and Gradient Boosting
for Credit Card Dataset.
tant across different datasets. For example, in the Bank dataset, Balance and EstimatedSalary emerged as top predictors, highlighting financial factors’ influence on churn. The
Credit Card dataset identified Total Trans Amt and Total Amt Chng Q4 Q1 as leading
indicators, pointing toward the significance of transaction changes and amounts in churn
prediction. The E-Commerce dataset emphasized the role of incentives and logistics, with
CashbackAmount and WarehouseToHome identified as top features. In the Employee
dataset, financial compensation, represented by features such as DailyRate and MonthlyIncome, was crucial in predicting employee churn. For the Internet dataset, billing and
subscription duration were significant, with Bill avg and subscription age identified as
the most important features. In the Cell2Cell dataset, changem and mou emerged as top
predictors, highlighting the significance of customer behavior and usage patterns. The
Membership dataset identified ANNUAL FEES and MEMBER ANNUAL INCOME as
leading indicators, suggesting that financial aspects are crucial in predicting membership churn. In the Nigeria Telecom dataset, features such as Total Onnet spend and
Total Data Consumption were identified as significant, pointing towards the importance
of usage patterns in churn prediction. The SA Wireless dataset highlighted the role of
aggregate revenue and data usage, with Aggregate SMS Rev and network age emerging

60

Figure 5.11: Top 10 Feature Importance in Catboost, LightGBM, and Gradient Boosting
for E-Commerce Dataset.
as top features. In both Telco Europa and Telecom datasets, Monthly Charge and Age
were the top features, suggesting that billing and customer demographics are pivotal in
churn prediction.
The Gradient Boosting model also revealed critical insights into the drivers of churn
across various datasets. In the Bank dataset, the focus was on product usage, with NumOfProducts 2 and NumOfProducts 1 emerging as the most important features. The
Credit Card dataset highlighted the significance of revolving balances and transaction
counts, with Total Revolving Bal and Total Trans Ct identified as top features. In the
E-Commerce dataset, customer loyalty and complaints were significant predictors, with
Tenure and Complain 0 as key features. The Employee dataset emphasized job roles and
work-life balance, with JobRole Manufacturing Director and WorkLifeBalance 1 identified as critical features. The Internet dataset highlighted the importance of contract
terms and usage patterns, with remaining contract and download avg emerging as top
features. In the Cell2Cell dataset, the focus was on retention efforts, with retcalls 0
and retcalls 1 emerging as the most important features, indicating the importance of
retention calls in predicting churn. The Membership dataset emphasized the importance of package types and occupation codes, with MEMBERSHIP PACKAGE TYPE-A

61

Figure 5.12: Top 10 Feature Importance in Catboost, LightGBM, and Gradient Boosting
for Employee Dataset.
and MEMBER OCCUPATION CD 2.0 identified as key features. In the Nigeria Telecom dataset, the analysis highlighted the significance of competitor networks, with Most
Loved Competitor network in Month 2 Weematel and Month 2 Zintel emerging as top
predictors. The SA Wireless dataset showed the importance of aggregate revenue and
favorite network features, with sep fav a ufone and sep fav a telenor identified as critical. Similar patterns were observed in the Telco Europa and Telecom datasets, where
Contract Month-to-Month and Number of Dependents were crucial in predicting churn.

62

Figure 5.13: Top 10 Feature Importance in Catboost, LightGBM, and Gradient Boosting
for Internet Dataset.

Figure 5.14: Top 10 Feature Importance in Catboost, LightGBM, and Gradient Boosting
models for the Cell2Cell dataset.

63

Figure 5.15: Top 10 Feature Importance in Catboost, LightGBM, and Gradient Boosting
models for the Membership dataset.

Figure 5.16: Top 10 Feature Importance in Catboost, LightGBM, and Gradient Boosting
models for the Nigeria Telecom dataset.

64

Figure 5.17: Top 10 Feature Importance in Catboost, LightGBM, and Gradient Boosting
models for the SA Wireless dataset.

Figure 5.18: Top 10 Feature Importance in Catboost, LightGBM, and Gradient Boosting
for Telco Europa Dataset.

65

Figure 5.19: Top 10 Feature Importance in Catboost, LightGBM, and Gradient Boosting
for Telecom Dataset.

66

Chapter 6
Discussion and Conclusion

6.1

Discussion

The results of this study provide a comprehensive understanding of the performance of
various machine learning models in predicting customer churn across multiple datasets.
By adopting an individual participant data meta-analysis approach, this research has
been able to integrate raw data, facilitating a nuanced comparison and validation of
predictive models under varying conditions.

6.1.1

Variability in Model Performance

The aggregated results reveal significant variability in the performance of machine learning models across different datasets and models. The fixed-effect model’s high heterogeneity (I2 of 68.76% by ML model and 97.55% by dataset) indicates substantial variability
not accounted for by this model. This high heterogeneity suggests that the differences
in effect sizes are not solely due to sampling error but also due to genuine differences in
dataset characteristics or model capabilities.

67

The standard deviations and standard errors across different models and datasets
also highlight this variability. For instance, the Employee dataset exhibited the highest
standard deviation (0.0885) and standard error (0.0256), indicating significant variability
in model performance. This suggests that some datasets are inherently more challenging
for churn prediction, likely due to their unique characteristics or data quality issues.

6.1.2

Top Performing Models and Datasets

Among the machine learning models, Catboost, Lightgbm, and Gradient Boosting consistently demonstrated high performance, with mean AUCs close to or above 0.90. These
models showed relatively narrow confidence intervals, indicating consistent performance
across various datasets. Their high performance can be attributed to their sophisticated
ensemble techniques, which combine multiple weak learners to create a strong predictive
model. These methods are particularly effective in handling complex relationships within
the data and mitigating overfitting.
Conversely, the Decision Tree model exhibited the lowest mean AUC and the highest
variability, suggesting its limited effectiveness in handling diverse datasets compared
to more sophisticated ensemble methods. The inherent simplicity of the Decision Tree
algorithm, while offering interpretability, also leads to higher susceptibility to overfitting
and poor generalization on unseen data.
In terms of datasets, the Internet and Credit Card datasets achieved the highest mean
AUCs, around 0.95, with low standard deviations and standard errors. This consistency
indicates that the models performed well on these datasets, likely due to their inherent
characteristics that align well with the models’ capabilities. Factors such as well-defined
features, balanced target variables, and higher data quality might contribute to the superior performance observed on these datasets.
On the other hand, the Membership dataset posed the most significant challenge,
with the lowest mean AUC and higher variability, indicating that the models struggled
68

to effectively predict churn in this dataset. This could be due to several reasons, such as
the presence of more noise, fewer relevant features, or a higher degree of class imbalance
that complicates the learning process.

6.1.3

Fixed-Effect vs. Random-Effect Models

The comparison between fixed-effect and random-effect models provides further insights.
The fixed-effect model, which assumes a common effect size across studies, showed high
precision but failed to account for the observed heterogeneity adequately. This model is
useful when the studies are assumed to be functionally identical and the only variability is
due to within-study sampling error. However, in real-world applications, this assumption
is often unrealistic due to genuine differences in datasets and study conditions.
In contrast, the random-effect model, which incorporates between-study variability,
offered slightly lower mean AUC estimates with wider confidence intervals but provided
a more realistic assessment by acknowledging the variability across datasets and models.
This model is more appropriate when the datasets are not identical and there is a need
to generalize the findings beyond the included studies.
The random-effect model’s higher pooled effect size estimate for the dataset analysis
(0.86 [0.81, 0.91]) compared to the fixed-effect model (0.85 [0.84, 0.86]) underscores the
importance of considering heterogeneity in meta-analyses. This approach offers a more
generalizable understanding of model performance, essential for practical applications in
varied real-world scenarios.

6.1.4

Discussion on Funnel Plots Analysis

The analysis of funnel plots presented in this study underscores several key insights into
the behavior of machine learning models and the impact of dataset characteristics on
model performance. The findings reveal both the utility of funnel plots in detecting
69

biases and their role in identifying models with optimal performance characteristics.
The observed slight asymmetry in the funnel plots for both fixed-effect and randomeffect models by machine learning model suggests a minor bias toward models demonstrating high performance. This asymmetry, albeit small, could indicate a propensity
for these models to perform better under specific conditions or with particular types of
data, which might not generalize across broader applications. The consistent positioning
of models like Catboost and LightGBM at the peak of these plots highlights their ability to achieve high accuracy while maintaining a balance between bias and variance, an
essential feature for models intended for practical deployment.
Conversely, the variability shown by models such as Decision Tree and KNeighbors
could be symptomatic of their lower tolerance to changes in data characteristics or their
susceptibility to overfitting. This aspect of model performance is crucial for developers
to consider when choosing models for real-world applications, where data can vary significantly from the conditions under which models are trained.
The symmetric distribution of points in the funnel plots by dataset, particularly in
the fixed-effect model, confirms a balanced and unbiased evaluation of datasets. The
high performance noted for datasets like Telco Europa and Telecom suggests that these
datasets are well-suited to developing and testing machine learning models, likely due to
their quality and representative nature. In contrast, the lower performance and higher
variability associated with the Membership dataset highlight the challenges that can
arise from datasets with specific characteristics or limitations. This observation stresses
the importance of careful dataset selection and preparation in achieving reliable model
performance.
The application of funnel plots extends beyond traditional publication bias detection
to a methodological tool for evaluating and selecting machine learning models based on
their performance stability and generalizability. By identifying models that not only
perform well but also show consistent results across different tests, practitioners can
better select algorithms for deployment in varied settings. This approach also offers a
70

systematic way to test model robustness before practical application, potentially reducing
the risk of deploying models that perform well in a controlled study but fail in real-world
scenarios.

6.1.5

Feature Importance

The feature importance analysis across various datasets using Catboost, LightGBM, and
Gradient Boosting revealed consistent patterns and highlighted specific variables that are
critical in predicting customer churn.
Demographic Factors: Features such as Age, Tenure, and Membership Term Years
appeared consistently across multiple datasets, underscoring the importance of demographic factors in predicting churn. This suggests that understanding the age distribution
and customer loyalty can provide valuable insights into customer retention strategies.
Financial Indicators: Financial-related features, including Balance, EstimatedSalary, and Annual Fees, were significant across various datasets. These findings highlight the role of financial stability in customer decision-making, where customers with
higher financial stability may be less likely to churn. This indicates the importance of
considering customers’ financial health when designing retention strategies.
Usage Patterns: Usage-related features, such as Total Trans Ct, Total Revolving Bal,
and Total Spend in Months 1 and 2 of 2017, were critical indicators of churn. These results suggest that customers who engage more frequently with the service or product are
less likely to churn. Monitoring these patterns can help identify at-risk customers and
design interventions to improve customer engagement and retention.
Contractual and Behavioral Factors: Features related to contracts and customer
behavior changes, such as Contract Month-to-Month and changes in service usage (e.g.,
changer and changem), were vital in predicting churn. These factors highlight the importance of flexible contract terms and monitoring customer behavior to reduce churn

71

rates. Companies should consider offering flexible contract options and closely monitoring customer behavior to preemptively address potential churn.
Incentives and Support: Features like CashbackAmount and Retention Calls (retcalls) pointed to the significance of customer incentives and support in retention strategies. This suggests that companies should implement more personalized incentives and
enhance customer support to improve retention. By focusing on these factors, businesses
can better align their offerings with customer needs and preferences, thereby reducing
churn.
Overall, this analysis provides valuable insights into the factors driving customer
churn across different industries and datasets. It also demonstrates the robustness of
the top-performing models (Catboost, LightGBM, and Gradient Boosting) in identifying
these critical features. These findings can be used to refine customer retention strategies,
focusing on the key drivers of churn identified in this study.

6.1.6

Implications for Practice

The insights gained from this meta-analysis have several practical implications. Firstly,
the consistent high performance of ensemble methods like Catboost, Lightgbm, and Gradient Boosting suggests that these models should be prioritized in customer churn prediction tasks. Their ability to handle complex data structures and mitigate overfitting
makes them well-suited for diverse datasets encountered in real-world applications.
Secondly, the significant variability in model performance across different datasets
highlights the need for dataset-specific strategies. For instance, datasets like Membership, which pose more challenges, may benefit from additional preprocessing steps such
as feature engineering, advanced imputation techniques for missing values, and more sophisticated methods for handling class imbalance.
Furthermore, the results underscore the importance of using random-effect models

72

in meta-analyses involving heterogeneous datasets. By accounting for between-study
variability, these models provide a more accurate and generalizable estimate of effect
sizes, which is crucial for informing practical decisions and ensuring the robustness of
predictive models in diverse settings.

6.1.7

Contributions of This Research

This research makes several significant contributions to the field of machine learning and
customer churn prediction:

• Methodological Advancement: By employing an IPD-MA approach, this study
integrates raw data from multiple datasets, allowing for a more nuanced and robust
comparison of machine learning models. This methodological advancement can be
applied to other domains where meta-analysis of machine learning performance is
required.
• Handling Heterogeneity: The research demonstrates the importance of accounting for heterogeneity in meta-analyses by comparing fixed-effect and random-effect
models. This contribution is crucial for developing more reliable and generalizable
predictive models in varied real-world scenarios.
• Practical Insights: The findings offer practical insights for practitioners in selecting and optimizing machine learning models for customer churn prediction. The
detailed analysis of model performance across different datasets provides a valuable
reference for tackling similar prediction tasks in industry.
• Future Research Directions: The study identifies key areas for future research,
such as exploring additional datasets, refining models, and incorporating more advanced machine learning techniques. These directions can guide further advancements in the field and improve predictive accuracy and generalizability.

73

6.1.8

Limitations and Future Research

While this study provides valuable insights, it also has several limitations. The reliance on
publicly available datasets means that the findings may not fully generalize to proprietary
datasets with different characteristics. Additionally, the study focused on a limited set
of machine learning models and did not explore the full spectrum of possible algorithms
and their hyperparameters.
Future research should aim to include a broader range of datasets, especially those
from different industries and contexts, to validate the findings further. Additionally,
exploring more advanced and hybrid machine learning models, as well as automated machine learning techniques, could yield even better performance and more robust insights.

6.2

Conclusion

This study has demonstrated the effectiveness of the IPD-MA approach in evaluating the
performance of machine learning models for predicting customer churn across multiple
datasets. The findings highlight the importance of considering dataset characteristics
and model capabilities in such analyses. Key takeaways from this research include:

• High-Performing Models: Catboost, Lightgbm, and Gradient Boosting emerged
as the top-performing models, consistently achieving high AUCs across various
datasets. These models should be prioritized for churn prediction tasks in future
applications.
• Dataset Challenges: The variability in model performance across datasets underscores the need for tailored approaches. The Internet and Credit Card datasets
demonstrated high model performance, whereas the Membership dataset posed significant challenges, highlighting the necessity for dataset-specific strategies.

74

• Meta-Analysis Models: The comparison between fixed-effect and random-effect
models emphasized the importance of accounting for heterogeneity in meta-analyses.
The random-effect model provided a more comprehensive understanding of model
performance by incorporating between-study variability.

Overall, this research contributes to the field by providing a robust methodological
framework and actionable insights for improving churn prediction models. Future work
should focus on exploring additional datasets and refining models to further enhance
predictive accuracy and generalizability. By leveraging the findings of this study, practitioners can make informed decisions in selecting and optimizing machine learning models
for effective customer churn prediction.

75

Bibliography
[1] L. Geiler, S. Affeldt, and M. Nadif, “A survey on machine learning methods for
churn prediction,” International Journal of Data Science and Analytics, vol. 14,
no. 3, pp. 217–242, 2022.
[2] P. Lalwani, M. K. Mishra, J. S. Chadha, and P. Sethi, “Customer churn prediction
system: A machine learning approach,” Computing, pp. 1–24, 2022.
[3] A. K. Ahmad, A. Jafar, and K. Aljoumaa, “Customer churn prediction in telecom
using machine learning in big data platform,” Journal of Big Data, vol. 6, no. 1,
pp. 1–24, 2019.
[4] I. Ullah, B. Raza, A. K. Malik, M. Imran, S. U. Islam, and S. W. Kim, “A churn
prediction model using random forest: Analysis of machine learning techniques for
churn prediction and factor identification in telecom sector,” IEEE access, vol. 7,
pp. 60 134–60 149, 2019.
[5] S. A. Qureshi, A. S. Rehman, A. M. Qamar, A. Kamal, and A. Rehman, “Telecommunication subscribers’ churn prediction model using machine learning,” in Eighth
international conference on digital information management (ICDIM 2013), IEEE,
2013, pp. 131–136.
[6] D. S. Sisodia, S. Vishwakarma, and A. Pujahari, “Evaluation of machine learning
models for employee churn prediction,” in 2017 international conference on inventive computing and informatics (icici), IEEE, 2017, pp. 1016–1020.

76

[7] M. Rahman and V. Kumar, “Machine learning based customer churn prediction in
banking,” in 2020 4th international conference on electronics, communication and
aerospace technology (ICECA), IEEE, 2020, pp. 1196–1201.
[8] D. L. Garcı́a, À. Nebot, and A. Vellido, “Intelligent data analysis approaches to
churn as a business problem: A survey,” Knowledge and Information Systems,
vol. 51, no. 3, pp. 719–774, 2017.
[9] D. Witten and G. James, An introduction to statistical learning with applications
in R. springer publication, 2013.
[10] A. Géron, Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow.
” O’Reilly Media, Inc.”, 2022.
[11] C. B. C. Latha and S. C. Jeeva, “Improving the accuracy of prediction of heart
disease risk based on ensemble classification techniques,” Informatics in Medicine
Unlocked, vol. 16, p. 100 203, 2019.
[12] L. Breiman, “Bagging predictors,” Machine learning, vol. 24, pp. 123–140, 1996.
[13] P. Bühlmann and B. Yu, “Analyzing bagging,” The annals of Statistics, vol. 30,
no. 4, pp. 927–961, 2002.
[14] L. Breiman, “Random forests,” Machine learning, vol. 45, pp. 5–32, 2001.
[15] Y. Freund, R. E. Schapire, et al., “Experiments with a new boosting algorithm,”
in icml, Citeseer, vol. 96, 1996, pp. 148–156.
[16] Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer, “An efficient boosting algorithm
for combining preferences,” Journal of machine learning research, vol. 4, no. Nov,
pp. 933–969, 2003.
[17] J. H. Friedman, “Greedy function approximation: A gradient boosting machine,”
Annals of statistics, pp. 1189–1232, 2001.
[18] J. Friedman, T. Hastie, and R. Tibshirani, “Additive logistic regression: A statistical view of boosting (with discussion and a rejoinder by the authors),” The annals
of statistics, vol. 28, no. 2, pp. 337–407, 2000.

77

[19] B. Sun, S. Chen, J. Wang, and H. Chen, “A robust multi-class adaboost algorithm
for mislabeled noisy data,” Knowledge-Based Systems, vol. 102, pp. 87–102, 2016.
[20] N. Asbai and A. Amrouche, “Boosting scores fusion approach using front-end diversity and adaboost algorithm, for speaker verification,” Computers & Electrical
Engineering, vol. 62, pp. 648–662, 2017.
[21] J. Haumahu, S. Permana, and Y. Yaddarabullah, “Fake news classification for indonesian news using extreme gradient boosting (xgboost),” in IOP Conference Series: Materials Science and Engineering, IOP Publishing, vol. 1098, 2021, p. 052 081.
[22] S. Džeroski and B. Ženko, “Is combining classifiers with stacking better than selecting the best one?” Machine learning, vol. 54, pp. 255–273, 2004.
[23] Y. Xiong, M. Ye, and C. Wu, “Cancer classification with a cost-sensitive naive
bayes stacking ensemble,” Computational and Mathematical Methods in Medicine,
vol. 2021, 2021.
[24] A. Elnagar, R. Al-Debsi, and O. Einea, “Arabic text classification using deep learning models,” Information Processing & Management, vol. 57, no. 1, p. 102 121, 2020.
[25] F. Pedregosa, G. Varoquaux, A. Gramfort, et al., “Scikit-learn: Machine learning
in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
[26] S. Bryan, “Weighting confusion matrices by outcomes and observations,” in 2020.
[27] G. V. Glass, “Primary, secondary, and meta-analysis of research,” Educational researcher, vol. 5, no. 10, pp. 3–8, 1976.
[28] M. Harrer, P. Cuijpers, T. Furukawa, and D. Ebert, Doing meta-analysis with R:
A hands-on guide. Chapman and Hall/CRC, 2021.
[29] R. D. Riley, P. C. Lambert, and G. Abo-Zaid, “Meta-analysis of individual participant data: Rationale, conduct, and reporting,” Bmj, vol. 340, 2010.
[30] T. D. Pigott and T. D. Pigott, “Including individual participant data in metaanalysis,” Advances in Meta-Analysis, pp. 109–131, 2012.

78

[31] S. H. Ballew, Y. Mok, and K. Matsushita, “A practical guide to interpret individual participant data meta-analysis of observational studies,” American Journal of
Kidney Diseases, vol. 78, no. 3, pp. 464–467, 2021.
[32] J. Chandler, M. Cumpston, T. Li, M. J. Page, and V. Welch, “Cochrane handbook
for systematic reviews of interventions,” Hoboken: Wiley, 2019.
[33] A. Shabtai, Y. Elovici, L. Rokach, A. Shabtai, Y. Elovici, and L. Rokach, Data
leakage detection/prevention solutions. Springer, 2012.
[34] V. R. Joseph and A. Vakayil, “Split: An optimal method for data splitting,” Technometrics, vol. 64, no. 2, pp. 166–176, 2022.
[35] E. Alshdaifat, D. Alshdaifat, A. Alsarhan, F. Hussein, and S. M. F. S. El-Salhi, “The
effect of preprocessing techniques, applied to numeric features, on classification
algorithms’ performance,” Data, vol. 6, no. 2, p. 11, 2021.
[36] C. Seger, An investigation of categorical variable encoding techniques in machine
learning: Binary versus one-hot and feature hashing, 2018.
[37] R. DerSimonian and N. Laird, “Meta-analysis in clinical trials,” Controlled clinical
trials, vol. 7, no. 3, pp. 177–188, 1986.

79

Appendix A
Source Codes
The data and code for this project is stored in this GitHub Repository.

80