THOMPSON RIVERS UNIVERSITY An Individual Participant Data Meta-Analysis of the Machine Learning Models in Predicting Customer Churn By Khanh Van Tran A GRADUATE PROJECT REPORT SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in Data Science KAMLOOPS, BRITISH COLUMBIA August, 2024 SUPERVISOR Dr. Jabed Tomal Dr. Md Erfanul Hoque © Khanh Van Tran, 2024 ABSTRACT Predicting customer churn is a critical task for businesses aiming to retain customers and maintain profitability. This research adopts an individual participant data metaanalysis (IPD-MA) approach to evaluate the effectiveness of various machine learning models in predicting customer churn across multiple publicly available datasets. This methodology facilitates a robust comparison and validation of predictive models by integrating raw data from different studies. The study employs a two-stage approach: first, individual datasets are analyzed to obtain machine learning performance metrics; second, these aggregated metrics are combined using fixed-effect and random-effect meta-analysis models. The results reveal significant variability in model performance across different datasets, with ensemble methods like Catboost, Lightgbm, and Gradient Boosting consistently outperforming other models, achieving the highest average AUCs of 0.9036, 0.9000, and 0.8936, respectively. The study also highlights the importance of considering dataset-specific characteristics and model capabilities, as well as the necessity of accounting for heterogeneity in meta-analyses. This research makes several key contributions, including methodological advancements in applying IPD-MA to machine learning, and a comprehensive evaluation of model performance. The findings offer a valuable reference for selecting and optimizing machine learning models in various industrial applications, guiding future research and practical implementations. Key Words: Customer Churn, Machine Learning, Meta-Analysis, Individual Participant Data, Fixed-Effect Model, Random-Effect Model. ii ACKNOWLEDGEMENTS Firstly, I would like to express my sincere gratitude to my two supervisors, Dr. Jabed Tomal and Dr. Md Erfanul Hoque for teaching, supervising, guiding, and mentoring me in the entire process of my graduation project, which indeed has helped me to shape my research properly and overcome lots of obstacles. An additional warm thank you goes to the instructors in the MSc Data Science program at Thompson Rivers University: Dr. Roger Yu, Dr. Mohamed Tawhid, Dr. Becky Wei Lin, Dr. Mateen Shaikh, and Dr. Mila Kwiatkowska. And I’d like to thank my family for the love and support they’ve shown me every step of the way. Their unwavering faith in me helped carry me through all this. iii Contents 1 Introduction 1 1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.5 Significance of The Study . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.6 Structure of The Project . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Background 5 2.1 The Importance of Customer Churn Prediction . . . . . . . . . . . . . . 5 2.2 Machine Learning in Churn Prediction . . . . . . . . . . . . . . . . . . . 6 2.3 Machine Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.1 Logistic Classification . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.2 The K-Nearest Neighbors (KNN) . . . . . . . . . . . . . . . . . . 9 iv CONTENTS 2.4 2.5 v 2.3.3 Support Vector Machines (SVM) . . . . . . . . . . . . . . . . . . 10 2.3.4 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.5 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.6 Ensemble Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . 12 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4.1 Stratified K-Fold Cross-Validation . . . . . . . . . . . . . . . . . . 16 2.4.2 Model Performance Metrics . . . . . . . . . . . . . . . . . . . . . 16 Meta-Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3 Data 22 4 Methodology 27 4.1 Individual Dataset - Exploratory Data Analysis and Data Processing . . 28 4.1.1 Exploratory Data Analysis (EDA) . . . . . . . . . . . . . . . . . . 28 4.1.2 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Individual Dataset - Models Evaluation . . . . . . . . . . . . . . . . . . . 36 4.2.1 Predictive Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2.2 Classifiers Performance Evaluation . . . . . . . . . . . . . . . . . 37 4.2.3 Hyperparameter Tuning . . . . . . . . . . . . . . . . . . . . . . . 37 4.3 Aggregate Effect Size and Standard Error . . . . . . . . . . . . . . . . . 38 4.4 Estimating Pooled Effect Size and Confidence Interval . . . . . . . . . . . 39 4.2 CONTENTS 4.5 4.6 vi 4.4.1 Fixed-Effect Model . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.4.2 Random-Effect Model . . . . . . . . . . . . . . . . . . . . . . . . 41 Residual Heterogeneity Estimates . . . . . . . . . . . . . . . . . . . . . . 42 4.5.1 Test for Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . 43 4.5.2 Fixed-Effect Model . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.5.3 Random-Effect Model . . . . . . . . . . . . . . . . . . . . . . . . 44 Funnel Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5 Results 47 5.1 Aggregated Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.1.1 Performance Metrics by Machine Learning Model . . . . . . . . . 47 5.1.2 Performance Metrics by Dataset . . . . . . . . . . . . . . . . . . . 48 Fixed- and Random-Effect Model Results . . . . . . . . . . . . . . . . . . 49 5.2.1 Meta-Analysis by Machine Learning Model . . . . . . . . . . . . . 50 5.2.2 Meta-Analysis by Dataset . . . . . . . . . . . . . . . . . . . . . . 51 Residual Heterogeneity Estimates . . . . . . . . . . . . . . . . . . . . . . 53 5.3.1 Residual Heterogeneity Estimates by Machine Learning Model . . 53 Funnel Plots Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.4.1 Funnel Plot Analysis by Machine Learning Model . . . . . . . . . 55 5.4.2 Funnel Plot Analysis by Dataset . . . . . . . . . . . . . . . . . . . 57 5.2 5.3 5.4 CONTENTS vii 5.5 58 Feature Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Discussion and Conclusion 6.1 6.2 67 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.1.1 Variability in Model Performance . . . . . . . . . . . . . . . . . . 67 6.1.2 Top Performing Models and Datasets . . . . . . . . . . . . . . . . 68 6.1.3 Fixed-Effect vs. Random-Effect Models . . . . . . . . . . . . . . . 69 6.1.4 Discussion on Funnel Plots Analysis . . . . . . . . . . . . . . . . . 69 6.1.5 Feature Importance . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.1.6 Implications for Practice . . . . . . . . . . . . . . . . . . . . . . . 72 6.1.7 Contributions of This Research . . . . . . . . . . . . . . . . . . . 73 6.1.8 Limitations and Future Research . . . . . . . . . . . . . . . . . . 74 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 A Source Codes 80 List of Figures 2.1 Stratified 5-Fold Cross Validation. . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3 The ROC-AUC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.1 Two-stage IPD-MA Workflow . . . . . . . . . . . . . . . . . . . . . . . . 28 4.2 Target Distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.3 Categorical Features Distribution in Credit Card Dataset. . . . . . . . . 31 4.4 Numerical Features Distribution in Bank Dataset. . . . . . . . . . . . . . 32 4.5 Correlations Among Numerical Features in Telcom Churn Dataset. . . . 33 5.1 Fixed-Effect Model Forest Plot by ML. . . . . . . . . . . . . . . . . . . . 50 5.2 Random-Effect Model Forest Plot by ML. . . . . . . . . . . . . . . . . . 51 5.3 Fixed-Effect Model Forest Plot by Dataset. . . . . . . . . . . . . . . . . . 52 5.4 Random-Effect Model Forest Plot by Dataset. . . . . . . . . . . . . . . . 53 5.5 Fixed-Effect Model Funnel Plot by Machine Learning Model. . . . . . . . 55 viii LIST OF FIGURES ix 5.6 Random-Effect Model Funnel Plot by Machine Learning Model. . . . . . 56 5.7 Fixed-Effect Model Funnel Plot by Dataset. . . . . . . . . . . . . . . . . 57 5.8 Random-Effect Model Funnel Plot by Dataset. . . . . . . . . . . . . . . . 58 5.9 Top 10 Feature Importance in Catboost, LightGBM, and Gradient Boosting for Bank Dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.10 Top 10 Feature Importance in Catboost, LightGBM, and Gradient Boosting for Credit Card Dataset. . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.11 Top 10 Feature Importance in Catboost, LightGBM, and Gradient Boosting for E-Commerce Dataset. . . . . . . . . . . . . . . . . . . . . . . . . 61 5.12 Top 10 Feature Importance in Catboost, LightGBM, and Gradient Boosting for Employee Dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.13 Top 10 Feature Importance in Catboost, LightGBM, and Gradient Boosting for Internet Dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.14 Top 10 Feature Importance in Catboost, LightGBM, and Gradient Boosting models for the Cell2Cell dataset. . . . . . . . . . . . . . . . . . . . . 63 5.15 Top 10 Feature Importance in Catboost, LightGBM, and Gradient Boosting models for the Membership dataset. . . . . . . . . . . . . . . . . . . . 64 5.16 Top 10 Feature Importance in Catboost, LightGBM, and Gradient Boosting models for the Nigeria Telecom dataset. . . . . . . . . . . . . . . . . 64 5.17 Top 10 Feature Importance in Catboost, LightGBM, and Gradient Boosting models for the SA Wireless dataset. . . . . . . . . . . . . . . . . . . . 65 5.18 Top 10 Feature Importance in Catboost, LightGBM, and Gradient Boosting for Telco Europa Dataset. . . . . . . . . . . . . . . . . . . . . . . . . 65 LIST OF FIGURES x 5.19 Top 10 Feature Importance in Catboost, LightGBM, and Gradient Boosting for Telecom Dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 List of Tables 3.1 Summary of Churn Datasets . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.1 Summary of Internet Churn Dataset . . . . . . . . . . . . . . . . . . . . 29 5.1 Mean AUC, Standard Deviation, and Standard Error for Each Model . . 48 5.2 Mean AUC, Standard Deviation, and Standard Error for Each Dataset . 49 xi Chapter 1 Introduction 1.1 Background and Motivation In today’s fast-paced business landscape companies are increasingly turning to data driven strategies to maintain their edge. Customer churn is when a customer stops purchasing products and services from a company. Customers are hard to acquire and might be very expensive for an organization. If a customer does not use your firm’s products and services in the long run, it will inevitably lead to the bad fate of an organization. This is why customer retention is also as important as customer acquisition. Without appropriate attention to customer churn management, it can result in heavy monetary losses, reduced profitability, and loss of visibility in the market. Therefore, organizations started relying on predictive analytics or machine learning approaches to predict or avoid customer churn. Machine learning (ML) technologies have revolutionized approaches, providing new ways to forecast and analyze customer churn. ML models use algorithms to analyze datasets and uncover patterns that may not be evident using traditional statistical methods. However, the effectiveness of these models can vary depending on the characteristics and industry specifics. 1 Given the availability of datasets and diverse ML techniques, there is a strong demand to combine these resources to evaluate the performance of these models thoroughly. This project utilizes an individual participant data meta-analysis a statistical technique that reanalyzes raw data from multiple studies to assess how reliable and effective ML models in predicting customer churn across different datasets and industries. 1.2 Problem Statement Despite the critical insights provided by individual studies on ML models for churn prediction, there is substantial variability in their outcomes. This variation may be influenced by factors such as dataset characteristics, industry-specific dynamics, and methodological approaches. This research aims to mitigate these inconsistencies through an IPD-MA, offering a deeper and more accurate understanding of how different models perform across varied contexts. By integrating data from multiple sources, this study seeks to establish a clearer understanding of the generalizability and limitations of current machine learning approaches to churn prediction. 1.3 Research Objectives This study is structured around several key objectives designed to address the complexities of machine learning in churn prediction: 1. Comprehensive Model Evaluation: To quantitatively assess the performance metrics of various machine learning models across 11 public datasets in predicting customer churn. 2. Cross-industry Applicability: To investigate the efficacy of these models across different industry sectors, identifying how contextual variables influence predictive success. 2 3. Feature Relevance Analysis: To explore which variables are consistently significant predictors of churn across multiple models and datasets, highlighting critical factors influencing customer retention. 1.4 Research Questions 1. Which machine learning models most accurately predict customer churn across diverse datasets? 2. Are there universally significant predictors of churn, or do these vary significantly across different models and datasets? 3. To what extent do machine learning models maintain their predictive accuracy across various industry sectors? 4. How can machine learning models be optimized to improve churn prediction based on the insights gained from the meta-analysis? 1.5 Significance of The Study The significance of this study lies in its potential to: • Bridge Knowledge Gaps: By providing an analysis of machine learning models across multiple datasets, this study aims to fill existing gaps in the literature regarding the comparative effectiveness and limitations of these models. • Enhance Business Strategies: The findings will aid businesses in refining their customer retention strategies by adopting the most effective predictive models tailored to their specific industry conditions. 3 • Contribute to Academic Knowledge: This research contributes to the academic field by demonstrating the application of IPD-MA in machine learning, potentially setting a precedent for future research methodologies. 1.6 Structure of The Project This research project is structured as follows: • Chapter 2: Background - This chapter comprehensively reviews existing studies on machine learning models for churn prediction, outlining theoretical underpinnings and methodological advancements. • Chapter 3: Data - Describes data sources used in the study. • Chapter 4: Methodology - Describes the meta-analytical approach, model selection, and statistical methods employed in the study. • Chapter 5: Results - Analyze the results and discuss the implications of findings in the context of existing research and industry practices. • Chapter 6: Discussion and Conclusion - discuss and conclude with a summary of key findings, discuss limitations, and suggest directions for future research. 4 Chapter 2 Background This chapter explains the importance of customer churn prediction. It also provides a summary of the literature review about the application of machine learning in the area of churn prediction. In addition, a brief description of the machine learning algorithms and meta-analysis techniques is also included. 2.1 The Importance of Customer Churn Prediction Predicting customer churn plays a crucial role in Customer Relationship Management (CRM). The main goal of the CRM is to build and sustain lasting relationships with customers. This approach is important in many sectors, particularly to subscriptionbased service firms, such as those in telecommunications, insurance, banking, and online services [1]. Given that such companies rely on consistent and regular membership fees, it is imperative to curb customer switching behavior to maintain sustainable profits. Consequently, the precise prediction of customers likely to churn has become a paramount objective in the industry. 5 2.2 Machine Learning in Churn Prediction Churn prediction study has gained popularity in recent years, due to its substantial impact on corporations and businesses. Machine learning has been widely studied and applied in churn prediction across several industries such as telecommunications, human resources, finance, and online services. Several case studies have been published in the telecommunications industry. Lalwani, Mishra, Chadha, et al. [2] utilized a range of predictive models, including logistic regression, Naive Bayes, support vector machine, random forest, decision trees, boosting, and ensemble approaches, to forecast customer turnover in the telecom sector. K-fold cross-validation was utilized for hyperparameter tuning and to mitigate overfitting. The Adaboost and XGBoost classifiers demonstrated the best accuracies among the built models, with 81.71% and 80.8%, respectively. Ahmad, Jafar, and Aljoumaa [3] created a churn prediction model utilizing machine learning techniques on a big data platform to help telecom operators forecast which customers are likely to churn. The dataset was acquired from SyriaTel telecommunications firm. The paper shows that XGBoost algorithm achieved the most optimal outcomes for predicting churn. In addition, Ullah, Raza, Malik, et al. [4] introduced a churn prediction model that utilizes classification and clustering approaches to identify consumers who are likely to churn and to understand the underlying variables contributing to customer turnover in the telecommunications industry. The model employed the Random Forest technique to classify churn and leveraged cosine similarity to categorize churn clients. The process involved the utilization of feature selection techniques and an attribute-selected classifier algorithm to detect and determine the key elements contributing to churn. The assessment of the suggested approach showcased greater churn categorization and client profiling, facilitating customer retention tactics, promotion suggestions, and improved marketing campaigns. A case study by Qureshi, Rehman, Qamar, et al. [5] on churn prediction in the mobile communication market. It describes the use of regression analysis, decision trees, artificial neural networks, and logistic regression to predict potential churners. The study used a dataset 6 from the Customer DNA website, where usage data for 106,000 customers was provided over three months, along with total usage by the customers. Dealing with the problem of class imbalance in the dataset, the authors check how the different machine learning algorithms performed regarding real usage by the customer, while demonstrating that the decision trees were the best classifiers for identifying potential churners. In the human resource (HR) management sector, Sisodia, Vishwakarma, and Pujahari [6] built a model for predicting employee churn rate based on HR analytics dataset, using five different machine learning algorithms, namely, linear support vector machine, decision tree classifier, random forest, k-nearest neighbor, and naı̈ve bayes classifier. The authors evaluated the correlation between attributes, generated a histogram to contrast left employees with various factors, and proposed strategies to optimize employee attrition in organizations. Rahman and Kumar [7]’s paper focuses on predicting customer churn in a commercial bank using efficient data mining methods. It discusses data transformation and classification techniques such as k-nearest neighbor, support vector machine, decision tree, and random forest. The paper results show that oversampling improves the accuracy of decision trees and random forest classifiers, while support vector machine is not suitable for large amounts of data. The study also analyzes customer behavior to explore the likelihood of churn and compares the performance of different models, finding that the Random Forest model after oversampling achieves higher accuracy. Several published papers also try to provide a generalizability of ML in churn prediction. Garcı́a, Nebot, and Vellido [8] suggested a more comprehensive literature review. The authors explain several phases involved in churn prediction analysis, including data collecting, feature selection, model implementation, and assessment techniques and metrics. Their survey closes with suggestions derived from the existing body of information. Geiler, Affeldt, and Nadif [1] focused on churn prediction in businesses and explored the performance of various supervised and semi-supervised learning methods and sampling approaches on publicly available datasets. The study suggests an ensemble approach 7 should be used for churn prediction. Given the diversity of methods and the varying results across different studies, there is a compelling need for a meta-analytical approach to churn prediction. Individual Participant Data Meta-Analysis stands out as an advantageous method for this purpose. By integrating data from multiple studies, IPD-MA allows for more comprehensive analysis and interpretation of churn prediction techniques across different datasets and industries. Furthermore, IPD-MA supports the application of uniform statistical methods across different studies, providing a more reliable and consistent framework for evaluating the efficacy of machine learning techniques in churn prediction. This approach is particularly valuable in the context of churn prediction, where variations in industryspecific factors and model implementations can significantly influence the effectiveness of predictive strategies. 2.3 Machine Learning Algorithms 2.3.1 Logistic Classification Logistic regression is the simplest machine learning algorithm for binary as well as multiclass classification. It estimates the probability that an instance belongs to one of the classes as a function of input features using the logistic function (Eq. 2.1). P (y = 1|x) = eβ0 +β1 x1 +β2 x2 +...+βk xk 1 + eβ0 +β1 x1 +β2 x2 +...+βk xk (2.1) where: • P (y = 1|x) is the probability that the outcome y is 1 given the input features x. • x = (x1 , x2 , . . . , xk ) is the vector of input features data. 8 • β0 , β1 , β2 , . . . , βk are the model parameters. • e is the base of the natural logarithm. During training, the model parameters are optimized using techniques such as maximum likelihood estimation and gradient descent [9][10]. It is very efficient to compute and highly interpretable. 2.3.2 The K-Nearest Neighbors (KNN) KNN is a machine learning method used for both regression and classification tasks. KNN utilizes distance measurements to predict the most probable value of the target feature. The Euclidean distance is often used in KNN and it is calculated using Equation 2.2 [9]. The predicted class in categorization is the most popular class among the k neighbors. The optimal k nearest neighbors are identified for the given case based on the crossvalidation. This method is non-parametric. It is a categorization method based solely on examples, utilizing existing data without generalization. It is called a lazy learning algorithm since its stages and operations are executed during the query. The algorithm does not require any preparation. It is effective with low-dimensional data but less so with high-dimensional data. To use KNN for high-dimensional data, we may use principle component analysis before implementing KNN. v u n uX d(x, y) = t (xi − yi )2 (2.2) i=1 where: • d(x, y) is the Euclidean distance between points x and y. • Pn i=1 denotes the summation over all dimensions from 1 to n. • xi and yi are the coordinates of points x and y in the i-th dimension, respectively. 9 • (xi − yi )2 is the squared difference between the i-th coordinates of the two points. 2.3.3 Support Vector Machines (SVM) SVM is a supervised machine learning technique that has been widely used for classification purposes. It searches the hyperplane that accurately divides the majority of the training data into two classes, while it may incorrectly categorize a small number of observations. This is the optimal solution to the following optimization problem [9]. maximize β0 ,β1 ,...,βp ,ϵ1 ,...,ϵN ,M M subject to (2.3) p X βj2 = 1 (2.4) j=1 yi (β0 + β1 xi1 + β2 xi2 + · · · + βp xip ) ≥ M (1 − ξi ) ξi ≥ 0, n X ξi ≤ C (2.5) (2.6) i=1 where: • M is the width of the margin. • β0 , β1 , . . . , βp are the coefficients of the hyperplane • ϵ1 , . . . , ϵN individual observations that allow fall on the incorrect side of the margin or the hyperplane. • x1 , . . . , xn set of n training observations • y1 , . . . , yn associated class labels • C is a nonnegative tuning parameter. 10 SVM can perform both linear and nonlinear classification. It can find a nonlinear separator, also known as a functional, by transforming the data into a higher-dimensional space. The algorithm then uses linear classification by selecting the support vectors to define the decision boundary or hyperplane. The SVM algorithm can utilize kernel functions, such as linear, polynomial, and radial basis function (RBF) to handle both linearly and nonlinearly separable data [10]. 2.3.4 Naive Bayes Naive Bayes is a type of probabilistic classifier model, which is based on the Bayes theorem. The Naive part comes from the assumption that observation characteristics are conditionally independent given the class label. It is stated mathematically as Equation 2.7 [9]. Multiple class predictions are made feasible by probabilistic classifiers. Based on conditional probability, the decision is made. πk × fk1 (x1 ) × fk2 (x2 ) × · · · × fkp (xp ) P r(Y = k | X = x) = PK l=1 πl × fl1 (x1 ) × fl2 (x2 ) × · · · × flp (xp ) (2.7) for k = 1,...,K where: • Pr(Y = k | X = x) is posterior probability that an observation X belongs to k class. • πk is prior probability that a randomly chosen observation comes from the k class. • fkj is the density function of the jth predictor among observations in the kth class. 11 2.3.5 Decision Trees Decision trees are powerful and versatile techniques used for classification and regression. It is commonly referred to as the Classification and Regression Trees (CART) method. The method is capable of handling data with large dimensionality and can process both numerical and categorical data. It operates by recursively partitioning the dataset into subsets based on the most informative features [9]. Each internal node of the tree represents a decision point where a feature is evaluated, and each leaf node corresponds to a class label or a regression value. Decision trees are constructed using various criteria such as Gini impurity or information gain, to optimize the splitting process [10]. They are interpretable models that facilitate human understanding of decision-making processes in complex datasets. Gi = 1 − n X p2i,k (2.8) k=1 where: • Gi is gini score of node i-th • pi,k is the training instance ratio of class k in the i-th node. 2.3.6 Ensemble Classifiers Ensemble method can help to minimize the variance and bias of separate models or classifiers by merging their predictions, resulting in more accurate and reliable models [11]. There are three commonly used ensemble learning methods, namely bagging, boosting, and stacking, which can be utilized to enhance the machine learning process. In this section, we will delve into the details of each method, including its working nature, characteristics regarding data generation, training of baseline classifiers, and suitable fusion 12 methods. We will also cover the advantages, disadvantages, and implementation challenges associated with each method. Bagging The bagging method, also referred to as bootstrap aggregating, is a data-driven algorithm that involves creating multiple subsets of data from the original dataset [12]. Bagging aims to generate diverse predictive models by adjusting the distribution of training datasets, where even small changes in the training data set can lead to significant changes in the model predictions. Bagging reduces variance, eliminates overfitting, and performs well on high-dimensional data. However, it also has some drawbacks such as being computationally expensive, having high bias, and reducing the interpretability of models [13]. The Random Forests algorithm is a notable example of the bagging technique [14]. Implementing the bagging method presents several challenges, such as determining the optimal number of base learners and subsets, the maximum number of bootstrap samples per subset, and the fusion method for integrating the outputs of the base classifiers using various voting methods. In summary, the bagging method employs parallel ensemble techniques with no data dependency, and the fusion methods depend on different voting methods to generate predictions. This approach generates B different bootstrapped training data sets. The algorithm is then trained on the bth bootstrapped training set to predict a point x. The following equation is the bagging function: B 1 X f (x) = fb (x) B b=1 (2.9) where fb (x) is a weak learner Boosting The boosting approach is a sequential procedure in which each succeeding model attempts to correct the prior model’s errors [15]. Boosting employs multiple weak learners in a highly adaptive way, where each model in the sequence is fitted while giving more 13 importance to observations that the previous models in the sequence handled poorly. Boosting can be used for both regression and classification problems and includes algorithms such as Adaptive Boosting (AdaBoost), Stochastic Gradient Boosting (SGB), and Extreme Gradient Boosting (XGB) [16]–[18]. Several studies have utilized various types of boosting, including AdaBoost for noise detection and speech feature extraction, and XGB for fake news classification [19]–[21]. Boosting provides interpretability and helps reduce variance and bias in machine learning ensembles. However, the drawback of boosting is that each classifier must correct errors in the predecessors, and scaling sequential training can be challenging. Additionally, boosting is computationally costly, vulnerable to overfitting, and slower to train than bagging. To summarize, boosting is an ensemble learning technique that uses a sequential approach, where multiple learners learn sequentially with data dependency, and fusion methods rely on various voting methods. The boosting function is shown as follows: f (x) = X λi gi (x) (2.10) i Where several classifiers gi (x) create a strong classifier f (x). The inclusion of the shrinkage parameter λ in the model slows down the process, which enables a wider range of differently shaped trees to be utilized to address the residuals. Stacking The stacking method is a powerful model ensembling technique that combines multiple predictive models to generate a new model, also known as a meta-model [22]. The stacking model architecture consists of two or more base models (level 0 models) and a meta-model that integrates the predictions of the base models (level 1 models). The base models are fit on the training data and their predictions are compiled, while the meta-model learns how to best combine these predictions. One of the key benefits of stacking is its ability to provide a deeper understanding of the data, leading to increased precision and effectiveness in predictions. However, a major challenge with stacking is 14 overfitting, which can occur when there are many predictors that all predict the same target. Additionally, multi-level stacking can be both data and time-intensive, as each layer adds multiple models. As the amount of available data grows exponentially, computation time complexity becomes an issue, and highly complex models may take months to run [23]. Another challenge with stacking is interpreting the final model, as well as identifying the appropriate number and baseline models that can be relied upon to generate better predictions from the dataset when designing a stacking ensemble from scratch. The problem of multi-label classification also poses issues such as overfitting and the curse of dimensionality due to the high dimensionality of the data [24]. To sum up, stacking is a parallel ensemble method that creates baseline learners simultaneously without any data dependency. The meta-learning method determines the fusion techniques. Although stacking can be very successful, it poses several challenges such as overfitting, complexity, and interpretability, which must be taken into account when using it. The stacking model is as follows: fs (x) = n X ai fi (x) (2.11) i=1 Stacking makes predictions from several models (f1 , f2 , . . . , fn ) to build a new model, where the new model is used to make predictions on the test dataset. Stacking aims to improve a model’s predictive ability. To put it simply, stacking involves combining the predictions of multiple models by assigning weights to each prediction and adding them together by a linear combination of weights ai . 15 2.4 Model Evaluation 2.4.1 Stratified K-Fold Cross-Validation Since the dataset is highly skewed, stratified K-fold cross-validation is used to evaluate ML models. In the K-fold cross-validation, the original dataset has first been divided into K-folds. Each fold has an equal number of training cases and an equal number of test cases. For model training, the train model is built based on K-1 folds as a training set and tested with the K-th test fold. This process has been repeated K times. Each fold takes a turn to be the test set. Finally, the K-time modeled performance has been summarized. The major disadvantage with the K-fold cross-validation design is that when there is a severe class imbalance between classes in the original data set, the size of one class in the training fold is much less than the number of cases in the other class of the original data set [25]. The consequence of that is that the performance outcome in the validation evaluation is too optimistic and in favor of the model. Therefore, stratified K-fold cross-validation is recommended in this case, and each fold should be stratified in the same way as the original data set. This can ensure that the number of each class in every fold is the same as the class distribution of the original data set. Figure 2.1 Stratified 5-fold cross-validation (Adapted from [25]) 2.4.2 Model Performance Metrics The performance of the models developed in this study is assessed using the confusion matrix and its derived metrics, namely accuracy, precision, recall, specificity, and F1 score. Figure 2.2 demonstrates the confusion matrix represented with actual and predicted outcome categories plotted against its axis (Adapted from [26]). 16 Figure 2.1: Stratified 5-Fold Cross Validation. Figure 2.2: Confusion matrix. Accuracy Equation 2.12 calculates accuracy as the proportion of correctly predicted classes over the total number of instances in the dataset [9]. Accuracy is used to measure the overall performance of a model but can be misleading in imbalanced datasets Accuracy = TP + TN TP + FP + TN + FN (2.12) Precision 17 Precision shows the proportion of positive predictions that are truly positive [9]. It is basically a ratio of correctly positively labeled to all positively labeled and can be calculated as follows: P recision = TP TP + FP (2.13) Recall Recall has other names Sensitivity or True Positive Rate. It is a measure of the percentage of actual positive classes that are predicted as positive [9]. It can be calculated as follows: Recall = TP TP + FN (2.14) F1 Score The F1 Score is a measure that is calculated from precision and recall (Equation 2.15). It is often a preferred metric over accuracy when data is unbalanced. F1 Score = 2 ∗ P recision ∗ Recall P recision + Recall (2.15) ROC-AUC The ROC AUC curve is an assessment metric for classification at various discrimination thresholds. The Receiver Operator Characteristic (ROC) is the probability curve, and AUC is the area under the ROC curve that estimates the degree of separability. It indicates how well the model can distinguish the two classes. The higher the AUC, the better the model [9]. AUC is calculated by plotting the True Positive Rate (TPR)/Sensitivity on the y-axis versus the False Positive Rate (FPR)/(1-Specificity) on the x-axis (Fig. 2.3). 18 Figure 2.3: The ROC-AUC. 2.5 Meta-Analysis Meta-analysis is a statistical method designed to synthesize and quantitatively analyze the results of multiple scientific studies that investigate the same research question. The term meta-analysis was first coined by statistician Gene V. Glass in the 1970s, who described it as the statistical analysis of a large collection of analysis results from individual studies for the purpose of integrating the findings [27]. Glass’s work laid the foundation for the development of meta-analysis methodologies which were enhanced in the late 1970s and 1980s by researchers such as Thomas D. Cook, Donald T. Campbell, and Frank L. Schmidt. These researchers introduced advanced techniques for systematically combining findings, evaluating heterogeneity among study outcomes, and addressing potential biases [28]. One key aspect of meta-analysis is the pooling of effect sizes, which involves aggregating the effect sizes from different studies to obtain an overall effect size. This process is typically conducted using either a fixed-effect model or a random-effect model. The fixed-effect model assumes that there is one true effect size which is shared by all included studies, regardless of between-study variance. It uses the following formula to calculate 19 the pooled effect size: Pn (w × yi ) Pn i θ = i=1 i=1 wi (2.16) where yi is the effect size from study i and wi is the weight of study i, typically the inverse of the variance of yi . The random-effect model, on the other hand, assumes that the effect sizes vary between studies due to genuine diversity in study characteristics, and incorporates this heterogeneity into the estimation of the overall effect size. The estimate of the pooled effect size in a random-effect model is given by: Pn (w∗ × y ) Pn i ∗ i θ = i=1 i=1 wi (2.17) where wi∗ = var(y1i )+τ 2 , and τ 2 (tau-squared) is the estimator of between-study variance. By the 1980s and 1990s, meta-analysis was widely accepted in various fields such as medicine, education, social sciences, and ecology. Prominent organizations like The Cochrane Collaboration and the Campbell Collaboration have promoted systematic reviews and meta-analyses to inform evidence-based practices in healthcare and social policy. Despite its broad applications, traditional meta-analysis using published aggregate data can face limitations, particularly when addressing heterogeneity in treatment effects and effect modifiers. To overcome these, individual participant data meta-analysis is used, employing either a one-stage or two-stage approach. The one-stage approach involves pooling raw data from all studies into a single model, treating it as if it originated from a single large study. This allows for complex statistical modeling of the data, including interactions between treatment effects and patient characteristics. In contrast, the two-stage approach first analyzes the data separately within each study to produce study-specific estimates and then combines these estimates in a second stage using metaanalytic techniques. Each method has its advantages and considerations, with the choice depending on the specific objectives and available data of the research. Numerous studies across diverse medical and public health domains have utilized 20 IPD meta-analysis to refine treatment protocols and improve outcome predictions. In medical and clinical trials, hypertension trials often employ IPD to assess the effects of treatments on blood pressure, enabling more precise estimates of treatment effects and the exploration of patient-level moderators [29]. Similarly, in cancer treatments, IPD meta-analyses have provided in-depth insights into the effectiveness and patient outcomes of colorectal cancer treatments [30]. In the realm of psychiatric and psychological research, studies such as those on antidepressant efficacy have used IPD to gain a nuanced understanding of treatment responses across different patient subgroups, particularly in moderate depression [30]. Epidemiological studies, including research on the relationship between maternal age and the risk of type I diabetes, leverage IPD to analyze detailed risk factors [30]. In chronic disease research, the Chronic Kidney Disease Prognosis Consortium employs IPD meta-analyses to provide extensive data on disease progression and patient outcomes in chronic kidney disease [31]. Additionally, in public health and preventive medicine, IPD meta-analyses are instrumental in evaluating the effectiveness of interventions aimed at preventing cardiovascular diseases, helping tailor recommendations to individual patient characteristics [32]. These applications highlight the critical role of IPD meta-analyses in producing precise and patient-specific results across various health-related fields. 21 Chapter 3 Data This study uses 11 available public churn datasets. Table 3.1 shows the datasets’ name, source, and number of data records as well as the number of features in each dataset. Table 3.1: Summary of Churn Datasets Dataset Names Sources Entries Features Telecom Customer Churn MavenAnalytics 7043 38 Internet Service Churn Kaggle 72274 11 Bank Customer Churn Records Kaggle 10000 18 Credit Card Churn Kaggle 10127 21 E-commerce Churn Kaggle 5630 20 Employee Churn Kaggle 1070 35 Telco Europa Kaggle 190776 20 Telcom Cell2Cell Kaggle 71047 70 Membership Subscription Kaggle 10362 15 Wireless Telecom South Asia Kaggle 2000 14 Nigeria Telecoms Churn Kaggle 1401 16 The first dataset that I examined is related to telecom customer churn. It can be 22 found on the Maven Analytics website. It consists of two CSV formatted tables. The Customer Churn table contains details of 7,043 customers from a telecommunications company in California during the quarter of 2022. Each entry in this dataset corresponds to a customer and includes information about their demographics, location, tenure services subscribed to, and customer status among other relevant details. Additionally, the Zip Code Population table offers population estimates for the California zip codes mentioned in the Customer Churn table. Another dataset used for analysis is the internet service churn dataset available on Kaggle. This dataset compiles customer data from an internet service provider with a total of 72,274 records containing 11 attributes. These characteristics range, from indicators like whether customers have subscribed to services such as TV and movie packages to complex variables like how long they have been subscribed (subscription age) the average amount they are billed (bill avg) and metrics related to service quality (service failure count download avg and upload avg). There are instances of missing data in attributes such as remaining contract details and internet speed measures (download avg and upload avg) which means that some cleaning or filling in of data is needed before further analysis can be done. The target feature is the churn attribute, which indicates whether customers are continuing with or ending their services. The other dataset named bank customer churn records which accessible on Kaggle. This dataset contains 10,000 records and 18 features, outlining the characteristics of bank clients. It covers information, financial details, and patterns of service usage including name, location, gender, credit score, account balance, and number of banking products used. The dataset includes factors related to customer satisfaction and engagement levels such, as the Satisfaction Score, Complaints, and Exited statuses. It also covers details like Card Type and Points Earned which may indicate how effective customer loyalty programs are. There are no missing data in the dataset making the preprocessing phase simple. This study also utilizes the credit card churn dataset available on Kaggle. This 23 dataset has 10,127 entries with 23 attributes, offering insights into credit card customers’ profiles. The dataset contains both categorical variables, like Marital Status and Income Category, and continuous variables such as Customer Age and Credit Limit. It also includes demographic information like age, gender, and education level along with behaviors such as credit limit usage and total transaction amount. Additionally, it tracks changes in customer status through features like Attrition Flag. The dataset e-commerce churn is also used for this research and can be found on Kaggle. It comprises 5,630 entries and 20 columns that represent customer attributes in an e-commerce environment. This dataset contains a mix of numerical and categorical data such as customer ID, gender, login device, payment methods, and cashback amount. Some columns have missing values suggesting the need for data cleaning before analysis. The dataset includes features like product purchase counts, customer demographics, and transaction specifics that could aid in understanding customer behavior and predicting churn in an e-commerce setting. The presence of a range of values, in product category and transaction columns implies that the dataset captures a diverse array of customer interactions and preferences. Employee attrition data is also included in this study. This dataset is available on Kaggle, which comprises 1,470 entries with 35 features. These features are categorical and numerical, offering an overview of the employees. It contains the employee demographic features like age and gender, job specifics such as department, role, and monthly earnings, and satisfaction indicators like work environment, job fulfillment, and work-life balance. The target variable is Attrition, signaling whether employees are still part of the organization or have moved on. Another dataset included in this study is Telco Europa. This dataset is accessible on Kaggle, having 190,776 entries with 20 attributes. These features include both categorical and continuous variables. These features include basic service usage metrics and extend to geographical and technical dimensions. Notable categorical variables like cni customer which represents customer identifiers, and churn, a binary indicator show- 24 ing whether a customer has discontinued the service or not. Continuous variables include days life representing the tenure of the customer in days, device technology indicating the type of device technology used, and various metrics related to call minutes such as tot min call out and avg min in 3 for the average incoming call minutes over the last three months. Other important variables are min plan and price plan, indicating the minutes and cost of the customer’s plan, respectively. Additionally, it features technical attributes for both data and voice services, such as tec ant data, state data, city data, tec ant voice, state voice, and city voice. The Telcom Cell2Cell dataset is another dataset that was used in this study. This dataset is downloaded from the Kaggle website. It provided by Duke University, consists of 71,047 records and 70 attributes focused on customer churn within a telecommunications framework. The dataset captures a wide array of information, including basic churn indicators, detailed usage metrics such as minutes of use, charges for overages and roaming, and customer service interactions including call quality metrics like dropped and blocked calls. Additionally, it provides insights into customer demographics and equipment usage, featuring variables that describe age, device usage duration, and possession of web-capable devices. The Membership Subscription dataset was also used in this project. This dataset contains 10,362 entries, detailing various aspects of membership within an organization, encapsulated in 15 attributes. These include the membership number, annual fees, and detailed member demographics such as marital status, gender, annual income, and occupation. Key attributes further encompass the type of membership package, the number of additional members, and the respective payment mode. Each entry also tracks the membership life-cycle through start and end dates and the membership status, providing a rich basis for analyzing membership retention. The South Asian Wireless Telecom Churn dataset comprises 2,000 records with 14 attributes that detail various aspects of customer usage and interaction with a telecom service provider. It includes metrics such as network age, total revenue, and revenue 25 specifics from SMS and data usage, alongside total data volumes. Notable features include categorical variables like user type and favorite services for different months, and continuous variables such as the number of calls made, revenue from on-net and offnet calls, and the count of customer complaints. Furthermore, the dataset tracks changes in customer status through the Class attribute, which indicates whether customers have continued or discontinued their services. The Nigeria Telecoms Churn dataset encompasses 1,401 entries with 16 attributes, detailing the telecom usage and interactions of customers in Nigeria. It includes comprehensive metrics such as network age, customer tenure, total expenditure over two months, SMS and data spending, and data consumption. The dataset further assesses customer engagement through metrics like the total number of unique calls, spending on calls within and outside the network, and the number of complaint calls to the service center. Additionally, it captures the network technology type preferred by customers across two months and their favored competitor networks during the same period. A pivotal aspect of this dataset is the Churn Status, which indicates whether customers have remained with or left the service provider, providing critical insights for churn analysis and customer retention strategies. 26 Chapter 4 Methodology This research adopts an individual participant data meta-analysis approach to evaluate the effectiveness of various machine learning models in predicting customer churn across multiple datasets. This methodology allows for a more nuanced analysis by integrating raw data from different studies, thus facilitating a comprehensive comparison and validation of predictive models under varying conditions. Since the public datasets have a large variation in the number of features and feature names, the two-stage approach in IDP-MA is applied in this project. In the first stage, each study (dataset) analyzes the IPD separately to obtain aggregate data. This includes ML performance effect estimates and standard errors. In the second stage, the aggregate data obtained from the first stage are then combined in a standard meta-analysis model. In this project, both commoneffect and random-effects model will be applied. The figure 4.1 shows the workflow of this project research methodology. In which, the first stage analysis processes is represented in blue box while second stage is in the green box. 27 Figure 4.1: Two-stage IPD-MA Workflow 4.1 Individual Dataset - Exploratory Data Analysis and Data Processing 4.1.1 Exploratory Data Analysis (EDA) This EDA process was similarly applied to all 11 datasets, with visualizations and statistical analyses tailored to the specific characteristics of each dataset. The EDA process included creating summary data tables for each dataset to identify unique values, missing values, NaN values, duplicated entries, and data types. It also involved visualizing the distribution of the target variable to identify class imbalances and analyzing the distribution of categorical features to understand customer preferences and service uptake. Additionally, the distribution of numerical features was examined to assess data range, central tendency, and dispersion. Finally, the correlation matrix was evaluated to identify multicollinearity and guide feature selection. These steps ensure a robust exploratory analysis, providing valuable insights that inform the predictive modeling and strategic decision-making processes. We begin by creating a summary data table for each dataset. The summary data table is a valuable tool in the exploratory data analysis phase, providing a concise overview 28 Table 4.1: Summary of Internet Churn Dataset Feature Unique Missing NaN Duplicated Dtypes reamining contract 247 21572 21572 0 float64 download avg 2856 381 381 0 float64 upload avg 802 381 381 0 float64 id 72274 0 0 0 int64 is tv subscriber 2 0 0 0 int64 is movie package subscriber 2 0 0 0 int64 subscription age 1110 0 0 0 float64 bill avg 179 0 0 0 int64 service failure count 19 0 0 0 int64 download over limit 8 0 0 0 int64 churn 2 0 0 0 int64 of the dataset’s characteristics. These tables highlight the number of unique values, missing values, NaN values, duplicated entries, and data types for each feature. This information is crucial for identifying categorical versus continuous features, guiding the handling of missing data, and ensuring appropriate data types for analysis and modeling. For instance, the table 4.1 reveals that the reamining contract feature has a significant number of missing values, necessitating imputation or exclusion, while binary features like is tv subscriber are confirmed as categorical. Additionally, understanding the extent of missing data and identifying features with high cardinality or duplicates helps prioritize data cleaning efforts and informs feature engineering decisions. The distribution of the target variable is a crucial aspect of EDA as it provides valuable insights into the balance or imbalance within the dataset. This information is essential for understanding the dataset’s characteristics and informing subsequent modeling decisions. Figure 4.2 effectively illustrates the varying distributions of target variables across 11 different datasets, providing clear examples of where imbalances exist. For instance, the Telecom Customer Status plot in figure 4.2 shows a significant difference 29 Figure 4.2: Target Distribution. between the number of customers who stayed versus those who churned, indicating a class imbalance. Identifying such imbalances is critical because they can bias the predictive model, leading to poor performance in predicting the minority class. Similarly, the Credit Card Attrition plot reveals an imbalance with more existing customers than attrited customers, highlighting the need for strategies to handle such disparities in data. Understanding these imbalances helps in selecting appropriate evaluation metrics like precision, recall, F1-score, and ROC-AUC, which are more informative for imbalanced datasets. 30 Figure 4.3: Categorical Features Distribution in Credit Card Dataset. Next, the distribution of categorical features is crucial in exploratory data analysis as it provides valuable insights into the composition and characteristics of the dataset. Understanding these distributions aids in feature engineering and selection, allowing me to decide which features to include in their models and how to preprocess them effectively. Figure 4.3 displays the distribution of several categorical features from the credit card dataset. Each subplot provides a count plot for a specific categorical feature, revealing key patterns and distributions within the data. For instance, the Distribution of Gender subplot shows a nearly balanced distribution between female and male customers, with a slight predominance of females. While, the Distribution of Education Level indicates that most customers are graduates, followed by high school graduates and those with unknown educational backgrounds, while the least represented are doctorate holders. The distribution of numerical features is an essential part of EDA. This analysis provides a comprehensive understanding of the dataset’s range, central tendency, and 31 Figure 4.4: Numerical Features Distribution in Bank Dataset. dispersion. For example, histograms and Kernel Density Estimation plots can reveal whether the data is normally distributed, skewed, or contains outliers, which is crucial for deciding on the appropriate statistical methods and data transformations to apply. This step ensures that the numerical data is accurately represented and ready for further analysis and modeling. Figure 4.4 displays the distribution of several numerical features from the bank customer dataset. The Distribution of CreditScore subplot shows an approximately normal distribution, centered around a mean value with a slight skew towards higher scores, indicating most customers have credit scores between 500 and 800, peaking around 700. The Distribution of Age is right-skewed, with the majority of customers aged between 30 and 50. The heatmap, which visualizes the correlation matrix, is another powerful tool in EDA. It helps in identifying multicollinearity among numerical features by displaying the correlation coefficients between pairs of variables. High correlations indicate redundancy, which can affect the performance of predictive models. For instance, if two features are highly correlated, one might be dropped or transformed to avoid multicollinearity issues in the model. The heatmap provides a clear and immediate visual representation of these relationships, making it easier to spot and address potential problems. By understanding 32 Figure 4.5: Correlations Among Numerical Features in Telcom Churn Dataset. the correlations, data scientists can make informed decisions about feature selection and engineering, ensuring that the models are both efficient and effective. Figure 4.5 shows the correlation coefficients among numerical variables in the telecommunications data set. It can be seen that there is a strong correlation between total revenue and total charges, with a correlation of 0.97. 4.1.2 Data Preparation Data preparation is the next step after data collection and before machine learning model building in this workflow. It includes data splitting, missing value treatment, scaling numerical variables, encoding categorical features, imbalanced data handling, etc. All these are vital steps for data preparation before training a machine learning algorithm. 33 Data Splitting Splitting the dataset into train and test sets is done first in the data preparation to avoid any potential data leakage during this step. Data leakage occurs when training models can access information outside of the training dataset [33]. This can lead to overestimating models’ performance during the training process. Once the model is deployed in production, its performance can decrease significantly. The predictions turn out to be far less reliable than the performance of the model during training would suggest. Steps can be taken to eliminate what is called data leakage so that the performance during training is a better reflection of what the model can do with data it has not seen already. Since the telecom customer churn dataset is unbalanced data, stratified splitting was applied to ensure that the class distribution of training and test sets is similar [34]. The initial dataset was split into 75 percent and 25 percent for the train and test sets, respectively. The random state is set to make sure the experiment result is reproducible. Handling Missing Values First, some non-informative columns, namely, customer id, city, zip code, latitude, and longitude were dropped, assuming that these variables simply do not offer predictive power. In addition, direct churn-related columns, churn category, and churn reason were also removed because they are directly related to churn and only available after customer churn. If these two columns were included in the model, it would introduce a very optimistic model for churn prediction. The process continues with the imputation of missing values for categorical variables. As pointed out in the previous exploratory data analysis, for service-related features like multiple lines, internet type, and online security, among others, we will fill the missing entries with No because these customers do not have a service. By replacing missing categorical data with meaningful labels, we avoid introducing bias arising from dropping rows that contain missing data, which may lead to the loss of important information. 34 For a continuous variable such as average monthly long distance charges or average monthly GB download, the imputation method used is zero filling. The rationale is that these customers do not subscribe to phone or internet service. Scaling Numerical Variables The numerical columns were scaled using min-max scaling. In this study, the MinMaxScaler function from the sklearn.preprocessing module is used to implement the scaling task. It calculates the difference between each value and the minimum feature value and then divides it by the difference between the maximum and minimum feature values. This scaling process transformed all the numerical features into a range between 0 and 1, which would help many machine learning algorithms that are very sensitive to the scale of the input features [35]. Encoding Categorical Variables In the encoding phase, binary categorical variables such as gender, married, phone service, and paperless billing are manually mapped to numeric binary values, with a straightforward mapping of female and yes to 1, and male and no to 0. This binary encoding transforms the categorical data into a format suitable for machine learning algorithms, which require numerical input. The remaining categorical variables undergo one-hot encoding, a process that converts categorical variable values into a form that could be provided to machine learning algorithms to do a better job at prediction. The OneHotEncoder creates a binary column for each category and returns a sparse matrix or dense array [25]. By employing one-hot encoding, the model can interpret the data without falsely attributing order or priority where it does not exist [36]. 35 4.2 Individual Dataset - Models Evaluation 4.2.1 Predictive Modeling In this study, a large number of machine learning models were implemented to find which model was best suited to the research tasks. Many of these models are ensembles that consist of other machine learning algorithms. First, logistic regression is selected for the model benchmark. Then the support vector machine was also included. This model works well in high dimensions without scaling. For comparability purposes, KNN, gaussian naive bayes, decision trees, and random forest classifiers were selected. Among these algorithms, KNN is a simple model that also works effectively in classification. The Gaussian Naive Bayes classifier is a naturally probabilistic approach known for its overall good performance in probabilistic classification. The decision tree is an easy-to-interpret model, and the random forest is robust and deals with overfitting. As an additional benchmark, several boosting algorithms were also implemented, including gradient boosting, AdaBoost, and bagging classifiers. These ensemble models are built from many weak learner models into strong ones. A multi-layer perceptron (MLP) classifier, which is an artificial neural network, was also employed. This model is capable of inferring highly non-linear relationships. Moreover, this study includes some highly sophisticated ensemble methods, such as XGBoost and LightGBM, that have proven to be very fast and highly competitive in many machine learning competitions. Each of the models was initialized with a random seed to ensure that the reproduction of the results was reproducible. The model’s performance is evaluated using multiple different metrics: accuracy, precision, recall, and F1 score, to give the best evaluation available. The final model will be decided upon by weighing all the performance measures as well as other considerations such as interpretability and computational efficiency. 36 4.2.2 Classifiers Performance Evaluation All the models were evaluated by stratified k-fold cross-validation with 5 splits. This approach ensures that the number of samples for each class is about the same in each split. Such splitting can better evaluate the model performance than a single train-test split, especially for imbalanced data. The metrics of model evaluation are accuracy, precision, recall, F1 score, and ROC-AUC. For each step of the cross-validation, the initial training data was split into training and validation sets. Then the model was trained on the training set and evaluated on the validation set, and the obtained score for each metric was recorded. This process was repeated for all training-validation splits, and the obtained set of scores was then averaged across folds to get a more stable estimate of the model’s performance. After cross-validation, the models were retrained on the entire initial training set and evaluated on the unseen test set to assess generalization performance. The models were then evaluated on the test set using the same metrics used in the previous evaluation. 4.2.3 Hyperparameter Tuning GridSearchCV and RandomisedSearchCV were used for hyperparameter tuning. This process was aimed at optimizing the hyperparameters to improve the model’s performance. GridSearchCV was used to do an exhaustive search over the parameter grid for each classifier. For each classifier, it evaluates all possible combinations of parameters and picks the best. While, RandomizedSearchCV used a fixed number of parameters randomly sampled. It is less intensive to compute than GridSearchCV, but sometimes it can find a good approximation of the best parameters. Once the tuning was done, the best parameters for each classifier were determined and re-trained on the entire training dataset. Finally, the retrained models were assessed on the unseen test dataset, where the ability to generalize can be estimated. 37 4.3 Aggregate Effect Size and Standard Error Once the performances of various machine learning algorithms on individual datasets were obtained, aggregate metrics were derived for meta-analysis. This approach provides a comprehensive understanding of the algorithms’ performances across multiple datasets and allows for a robust comparison. Aggregate Effect Size The aggregate effect size is a critical metric in meta-analysis, offering a consolidated measure of the performance of each ML algorithm across all datasets. This was achieved by calculating the average performance metric of each ML algorithm across all datasets. The mean AUC will be used as an aggregated effect size for the next step meta-analysis. This aggregate measure helps in understanding the overall efficacy of each algorithm and identifying which performs best on average. Additionally, the average performance of all ML algorithms on each individual dataset was computed, providing insights into which dataset presented the greatest challenges or the easiest conditions for the algorithms. Standard Error Standard error (SE) plays a crucial role in understanding the variability and reliability of the aggregate effect size. It measures the precision of the average performance estimates. For each ML algorithm’s aggregate performance, the SE was calculated to quantify the dispersion of performance metrics across the datasets. This involves computing the standard deviation of the performance metrics for each algorithm and dividing it by the square root of the number of datasets. Similarly, the SE for the average performance of all ML algorithms on each individual dataset was calculated. This helps in assessing the consistency of the algorithms’ performances on a particular dataset and the robustness of the aggregate metrics. 38 Calculating Aggregate Performance for Each ML Algorithm Across All Datasets 1. For each ML algorithm, the performance metrics from all datasets were collected. 2. The mean performance metric for each algorithm was computed. 3. The standard deviation of these performance metrics was calculated. 4. The SE for each algorithm’s average performance was derived using the formula: SE = Standard Deviation √ N (4.1) where N is the number of datasets. Calculating Aggregate Performance for All ML Algorithms on Each Dataset 1. For each dataset, the performance metrics of all ML algorithms were collected. 2. The mean performance metric for each dataset was computed. 3. The standard deviation of these performance metrics was calculated. 4. The SE for each dataset’s average performance was derived using the same formula: SE = Standard Deviation √ N (4.2) where N is the number of evaluated ML models. 4.4 Estimating Pooled Effect Size and Confidence Interval 4.4.1 Fixed-Effect Model The fixed-effect model assumes that the effect size is consistent across all studies or datasets, and any observed variation is just due to sampling error. The following section 39 describes the steps and equations to estimate the pooled effect size and its confidence interval (CI) from the aggregated mean AUC and standard error. 1. Compute the Weights: • Calculate the variance of each effect size: vi = SEi2 (4.3) • Assign a weight (wi ) to each effect size, which is the inverse of its variance: wi = 1 vi (4.4) • The weight reflects the precision of each effect size, giving more importance to effect sizes with smaller standard errors. 2. Calculate the Pooled Effect Size: ˆ C) is a weighted average of the individual effect • The pooled effect size (AU sizes: ˆC = AU Pk ˆ i=1 wi AU Ci Pk i=1 wi (4.5) where k is the number of datasets or studies. 3. Estimate the Variance of the Pooled Effect Size: • The variance of the pooled effect size (VAUˆ C ) is calculated as: VAUˆ C = Pk 1 i=1 wi (4.6) 4. Compute the Standard Error of the Pooled Effect Size: • The standard error (SE) of the pooled effect size is the square root of its variance: SEAUˆ C = p VAUˆ C (4.7) 5. Construct the Confidence Interval: 40 • The 95% confidence interval for the pooled effect size is calculated using the standard error and the critical value from the standard normal distribution (usually 1.96 for a 95% CI): ˆ C ± 1.96 × SE ˆ CI95% = AU AU C 4.4.2 (4.8) Random-Effect Model The random-effects model assumes that the effect sizes vary across studies or datasets due to real differences in effects, as well as sampling error. This model incorporates both within-study and between-study variability to provide a more generalizable estimate of the pooled effect size and its confidence interval (CI). The following section describes the process and equations to estimate the pooled effect size and its CI from aggregated mean AUC and standard error. 1. Compute the Weights: • Calculate the variance of each effect size: vi = SEi2 (4.9) • Estimate the between-study variance (τ 2 ) using the DerSimonian and Laird method ([37], [28]):   Q − (k − 1) τ = max 0, C 2 where Q= k X (4.10) wi∗ (AUˆCi − AUˆCw )2 (4.11) 1 vi (4.12) i=1 wi∗ = AUˆCw = C= k X i=1 Pk ∗ ˆ i=1 wi AU Ci Pk ∗ i=1 wi wi∗ − (4.13) Pk ∗ 2 i=1 (wi ) Pk ∗ i=1 wi (4.14) 41 • Compute the random-effects weights (wi ): wi = 1 vi + τ 2 (4.15) 2. Calculate the Pooled Effect Size: ˆ C) is a weighted average of the individual effect • The pooled effect size (AU sizes: ˆC = AU Pk ˆ i=1 wi AU Ci Pk i=1 wi (4.16) where k is the number of datasets or studies. 3. Estimate the Variance of the Pooled Effect Size: • The variance of the pooled effect size (VAUˆ C ) is calculated as: VAUˆ C = Pk 1 i=1 wi (4.17) 4. Compute the Standard Error of the Pooled Effect Size: • The standard error (SE) of the pooled effect size is the square root of its variance: SEAUˆ C = p VAUˆ C (4.18) 5. Construct the Confidence Interval: • The 95% confidence interval for the pooled effect size is calculated using the standard error and the critical value from the standard normal distribution (usually 1.96 for a 95% CI): ˆ C ± 1.96 × SE ˆ CI95% = AU AU C 4.5 (4.19) Residual Heterogeneity Estimates Residual heterogeneity refers to the variability in effect sizes that remains unexplained after accounting for within-study variance in meta-analysis. This section outlines the 42 methodology for estimating residual heterogeneity in both fixed-effect and random-effect models. 4.5.1 Test for Heterogeneity The test for heterogeneity examines whether the variability observed among study estimates is greater than what would be expected by chance alone. In this context, the null hypothesis (H0 ) is that all studies share a common effect size, suggesting no significant heterogeneity. The alternative hypothesis (HA ), on the other hand, proposes that there is significant heterogeneity, indicating that the variability among study estimates is not solely due to sampling error but also due to true differences in effect sizes. The Q statistic is used to test for heterogeneity, and a significant Q statistic, particularly with a low p-value, supports the rejection of the null hypothesis, confirming the presence of heterogeneity among the studies. The Q-statistic follows a chi-square distribution with k − 1 degrees of freedom, where k is the number of studies. A significant Q-statistic (p-value < 0.05) indicates the presence of heterogeneity. 4.5.2 Fixed-Effect Model In a fixed-effect model, it is assumed that there is a single true effect size that is common across all studies, and any observed variation in effect sizes is due solely to within-study sampling error. However, if there is unexplained heterogeneity, it can be assessed using the following steps: 1. Calculate the Q-statistic: The Q-statistic tests for heterogeneity by comparing the observed variability in effect sizes to what would be expected by chance alone. It is calculated as: Q= k X wi (AUˆCi − AUˆCw )2 (4.20) i=1 43 where wi is the weight assigned to each effect size (the inverse of the variance), AUˆCi is the observed effect size for study i, and AUˆCw is the weighted average effect size. 2. Calculate the I2 Statistic: The I2 statistic quantifies the proportion of total variation in effect sizes that is due to heterogeneity rather than chance. It is calculated as: I2 = Q − (k − 1) × 100% Q (4.21) An I2 value greater than 50% typically indicates substantial heterogeneity. 3. Calculate the H2 Statistic: The H2 statistic represents the ratio of total variability to sampling variability. It is calculated as: H2 = Q k−1 (4.22) H2 is a measure of heterogeneity, with values greater than 1 indicating the presence of variability beyond what would be expected by sampling error alone. 4.5.3 Random-Effect Model In a random-effect model, it is assumed that there is not a single true effect size, but rather a distribution of true effect sizes across studies. The model accounts for both within-study variance and between-study variance, often referred to as tau-squared (τ 2 ). The steps to estimate residual heterogeneity in a random-effect model are as follows: 1. Estimate Between-Study Variance (τ 2 ): The DerSimonian and Laird method is commonly used to estimate the between-study variance. It is calculated as:   Q − (k − 1) 2 τ = max 0, (4.23) C where Q= k X wi∗ (AUˆCi − AUˆCw )2 (4.24) i=1 44 wi∗ = AUˆCw = C= k X 1 vi (4.25) Pk ∗ ˆ i=1 wi AU Ci Pk ∗ i=1 wi wi∗ − i=1 (4.26) Pk ∗ 2 i=1 (wi ) Pk ∗ i=1 wi (4.27) 2. Compute the random-effects weights (wi ): wi = 1 vi + τ 2 (4.28) 3. Compute the Random-Effects Weights: Adjust the weights to account for between-study variability: wi = 1 vi + τ 2 (4.29) where vi is the within-study variance. 4. Calculate the I2 and H2 Statistics: Similar to the fixed-effect model, I2 is used to quantify the proportion of total variation due to heterogeneity, and H2 represents the ratio of total variability to sampling variability. I2 = Q − (k − 1) × 100% Q (4.30) Q k−1 (4.31) H2 = 4.6 Funnel Plots Funnel plots are a graphical tool widely used in meta-analyses to assess publication bias and heterogeneity among studies. Funnel plots are constructed by plotting the performance measure on the horizontal axis against a measure of study precision or size on the vertical axis. In machine learning evaluations, the mean performance metric of a model is plotted against the inverse of the standard error of the performance measure. This arrangement allows for the visual detection of asymmetry in data that might suggest 45 potential biases. In this research, funnel plots are employed to visually assess the bias and variance in the performance of machine learning models. Two types of plots were prepared. Funnel plots for each machine learning model across multiple datasets and funnel plots for each dataset across different machine learning models. These plots are critical for identifying whether certain models or datasets consistently perform differently from the general trend and whether there are any signs of publication bias in reporting model performances. 46 Chapter 5 Results 5.1 Aggregated Data This section presents the aggregated results of the mean AUC, standard deviation (SD), and standard error (SE) for each machine learning (ML) model and each dataset. The results provide a general understanding of the performance metrics across various ML models and datasets, highlighting both the models’ overall effectiveness and variability. This data will be used as input data for the second stage of meta-analysis. 5.1.1 Performance Metrics by Machine Learning Model Table 5.1 provides the mean AUC, standard deviation, and standard error for each ML model. These metrics reflect the aggregated performance of each model across all datasets. Catboost achieves the highest mean AUC (0.9036) among the ML models, indicating its strong overall performance across the datasets. Lightgbm (0.9000), Gradient Boosting (0.8936), and Random Forest (0.8891) also demonstrate high mean AUCs, suggesting 47 Table 5.1: Mean AUC, Standard Deviation, and Standard Error for Each Model Model Mean AUC Standard Deviation (SD) Standard Error (SE) AdaBoost 0.8836 0.0808 0.0244 Bagging 0.8682 0.0964 0.0291 Catboost 0.9036 0.0836 0.0252 Decision Tree 0.7609 0.1291 0.0389 GaussianNB 0.7882 0.0708 0.0214 Gradient Boosting 0.8936 0.0794 0.0239 KNeighbors 0.7845 0.1085 0.0327 Lightgbm 0.9000 0.0860 0.0259 Logistic Regression 0.8345 0.0859 0.0259 MLP 0.8745 0.0929 0.0280 Random Forest 0.8891 0.0884 0.0266 XGboost 0.8918 0.0923 0.0278 their effectiveness in handling diverse datasets. The Decision Tree model shows the lowest mean AUC (0.7609) and the highest standard deviation (0.1291), indicating significant variability in its performance. The high standard error (0.0389) for the Decision Tree further confirms this variability. 5.1.2 Performance Metrics by Dataset Table 5.2 summarizes the mean AUC, standard deviation, and standard error for each dataset. These metrics reflect the aggregated performance of all ML models on each dataset. The Internet dataset has the highest mean AUC (0.9508) with a relatively low standard deviation (0.0456) and standard error (0.0132), indicating consistent performance across ML models. Similarly, the Credit Card dataset also exhibits a high mean AUC 48 Table 5.2: Mean AUC, Standard Deviation, and Standard Error for Each Dataset Dataset Mean AUC Standard Deviation (SD) Standard Error (SE) Telcom 0.9042 0.0476 0.0137 Internet 0.9508 0.0456 0.0132 Bank 0.8242 0.0570 0.0164 Credit Card 0.9517 0.0491 0.0142 E-Commerce 0.9467 0.0557 0.0161 Employee 0.8125 0.0885 0.0256 Telco Europa 0.8942 0.0763 0.0220 Cell2Cell 0.8167 0.0587 0.0169 Membership 0.6700 0.0413 0.0119 SA Wireless Telcom 0.8000 0.0644 0.0186 Niger Telcom 0.8458 0.0799 0.0231 (0.9517), demonstrating the models’ strong performance on this dataset. On the other hand, the Membership dataset has the lowest mean AUC (0.6700), suggesting that the models performed less effectively on this dataset. The Employee dataset shows the highest standard deviation (0.0885) and standard error (0.0256), indicating higher variability in model performance. 5.2 Fixed- and Random-Effect Model Results This section presents the results of the meta-analysis conducted using both fixed-effect and random-effect models for each machine learning model and each dataset. The forest plots generated from these analyses are discussed to provide a comprehensive understanding of the performance metrics and their variability across ML models and datasets. 49 5.2.1 Meta-Analysis by Machine Learning Model Figure 5.1: Fixed-Effect Model Forest Plot by ML. The fixed-effect model forest plot (Figure 5.1) indicates that Catboost and Lightgbm have the highest mean AUCs (both around 0.90), with relatively narrow confidence intervals, suggesting consistent high performance across datasets. The Decision Tree model has the lowest mean AUC (0.76) with a wide confidence interval, indicating significant variability in performance. The overall fixed-effect model estimate for ML models is 0.86 [0.85, 0.88], representing the combined performance across all models. These results indicate a high average performance (AUC) with narrow confidence intervals, suggesting precise estimates. However, the significant heterogeneity suggests that the fixed-effect model may not fully account for the variability across studies. The random-effect model forest plot (Figure 5.2) shows similar results, with Catboost and Lightgbm still performing the best. The overall random-effect model estimate is 0.86 [0.83, 0.89], slightly lower than the fixed-effect model, reflecting the incorporation 50 Figure 5.2: Random-Effect Model Forest Plot by ML. of between-model variability. The Decision Tree model continues to show the lowest performance with considerable variability. 5.2.2 Meta-Analysis by Dataset The fixed-effect and random-effect models were applied to the aggregated performance data for each dataset. The forest plots for these models are shown in Figures 5.3 and 5.4 The fixed-effect model forest plot (Figure 5.3) shows that the Credit Card, Internet, and E-Commerce datasets have the highest mean AUCs, all around 0.95. These datasets demonstrate narrow confidence intervals, indicating consistent performance across different ML models. The Membership dataset has the lowest mean AUC (0.67), with a narrower confidence interval, indicating less variability in performance but consistently lower AUC values. The overall fixed-effect model estimate is 0.85 [0.84, 0.86], representing the combined performance across all datasets. 51 Figure 5.3: Fixed-Effect Model Forest Plot by Dataset. These results indicate a high average performance (AUC) with narrow confidence intervals, suggesting precise estimates. The very high heterogeneity suggests that the fixed-effect model may not fully account for the variability across datasets. The random-effect model forest plot (Figure 5.4) provides a similar view but accounts for between-study variability. The overall random-effect model estimate is slightly higher at 0.86 [0.81, 0.91], reflecting the incorporation of heterogeneity among datasets. The Credit Card and Internet datasets maintain high mean AUCs with slightly wider confidence intervals, indicating some variability in model performance across these datasets. In general, the meta-analysis results highlight the variability in performance across different ML models and datasets. Among the ML models, Catboost, Lightgbm, and Gradient Boosting are the top performers, whereas the Decision Tree model shows less consistent results. The Credit Card, Internet, and E-Commerce datasets generally exhibit high AUCs, while the Membership dataset poses more challenges. These insights provide a clear understanding of how different datasets and models perform, guiding future model 52 Figure 5.4: Random-Effect Model Forest Plot by Dataset. selection and optimization efforts. 5.3 Residual Heterogeneity Estimates 5.3.1 Residual Heterogeneity Estimates by Machine Learning Model For the fixed-effect model analysis, which assumes a common effect size across all studies, the Q statistic was calculated to be 35.2159 with 11 degrees of freedom, resulting in a p-value of 0.0002. This significant result indicates substantial heterogeneity among the studies, leading to the rejection of the null hypothesis. The I2 statistic, which quantifies the proportion of total variability due to heterogeneity, was found to be 68.76%, indicating a considerable level of heterogeneity. The H2 statistic, which measures total 53 variability relative to sampling variability, was calculated as 3.20, further suggesting notable heterogeneity. In the random-effect model analysis, which accounts for both within-study and between-study variability, similar patterns of heterogeneity were observed. The I2 statistic remained high at 68.57%, and the H2 value was 3.18. Additionally, the tau-squared (Tau2 ) value, representing the total amount of heterogeneity, was estimated at 0.0016 with a standard error of 0.0010, and the square root of Tau2 (Tau) was 0.0395. The Q statistic was consistent with the fixed-effect model, reinforcing the conclusion of significant heterogeneity. In the analysis by dataset using the fixed-effect model, a high degree of heterogeneity was evident. The Q statistic was exceptionally high at 408.4644 with 10 degrees of freedom, yielding a p-value of less than 0.0001, indicating significant heterogeneity among the datasets. The I2 statistic was calculated to be 97.55%, highlighting an extremely high level of heterogeneity. The H2 statistic was 40.85, suggesting that the observed variability greatly exceeds what would be expected due to sampling variability alone. The random-effect model analysis produced similar findings. The I2 statistic was slightly lower at 96.44%, still indicating substantial heterogeneity. The H2 value was 28.13, again suggesting significant variability beyond sampling error. The Tau2 value was estimated at 0.0072 with a standard error of 0.0033, and the square root of this value (Tau) was 0.0846. The Q statistic remained consistent with the fixed-effect model, further confirming the presence of significant heterogeneity across the datasets. In summary, the residual heterogeneity estimates, as measured by I2 , H2 , and the Q statistic, indicate that both the ML model and dataset analyses exhibit substantial heterogeneity. This suggests that the observed variability among the studies is due to true differences in effect sizes rather than mere sampling error. 54 5.4 Funnel Plots Analysis 5.4.1 Funnel Plot Analysis by Machine Learning Model Figure 5.5: Fixed-Effect Model Funnel Plot by Machine Learning Model. The fixed-effect model’s funnel plot (Figure 5.5) displays a tight clustering of machine learning models around the peak mean AUC value of approximately 0.9. Although this concentration suggests a high degree of consistency in model performance, a slight asymmetry in the distribution of points around the vertical pooled effect line indicates a potential minor bias in the pooled effect estimation that is slightly leaning toward high performance models. In addition to the traditional interpretation of publication bias using the funnel plot, in the context of this study, the funnel plot can be used to select models with the optimum bias and variance trade-off. Notably, models like Catboost and LightGBM, which are situated at the peak, achieve top AUC scores with minimal standard errors, showcasing their effectiveness in balancing bias and variance. In contrast, models such as Decision Tree and KNeighbors, appearing towards the lower 55 end of the spectrum, exhibit variability that may reflect their sensitivity to training data variations or inherent methodological limitations. Figure 5.6: Random-Effect Model Funnel Plot by Machine Learning Model. The random effect model’s funnel plot (Figure 5.6) also shows a symmetrical distribution of machine learning models around a high mean AUC value. The slight asymmetry around the vertical pooled effect line in this plot also suggests a potential minor bias in the pooled effect estimation. This observation supports the use of the funnel plot not only for detecting potential biases in pooled effect estimation but also for choosing models that maintain an optimal balance between bias and variance. High-performing models such as Catboost and LightGBM continue to demonstrate top-tier performance with low variability, while models like Decision Tree and KNeighbors display greater dispersion, highlighting their potential instability and sensitivity to different data scenarios. 56 Figure 5.7: Fixed-Effect Model Funnel Plot by Dataset. 5.4.2 Funnel Plot Analysis by Dataset The fixed effect model’s funnel plot by dataset (Figure 5.7) illustrates a symmetric distribution around the pooled mean AUC line, indicating a balanced and unbiased representation of datasets. The consistency and high performance of datasets such as Telco Europa and Telecom, which are noted for high mean AUC values with minimal standard errors, demonstrate their robustness and reliability. Conversely, datasets like Membership, which show lower mean AUC values and higher variability, may suggest specific challenges affecting the models’ performance. The random effect model’s funnel plot (Figure 5.8) maintains the symmetry observed in the fixed effect model, confirming the absence of bias in the dataset evaluations. This plot emphasizes the strong and consistent performance across datasets such as Internet, Credit Card, and Membership, demonstrating their suitability for testing models in various analytical conditions, with high mean AUC values and minimal variability indicating 57 Figure 5.8: Random-Effect Model Funnel Plot by Dataset. stable model performance. 5.5 Feature Importance The feature importance analysis was conducted using the top three performing machine learning models: Catboost, LightGBM, and Gradient Boosting, across multiple datasets. This analysis aimed to identify the key drivers of customer churn. Figures 5.9 to 5.19 illustrate the top 10 most important features as determined by each model for various datasets. In the Catboost model, significant patterns were observed across the datasets. For instance, in the Bank dataset, features like Age, Balance, and CreditScore were identified as the most critical, emphasizing the role of financial stability and customer demographics in predicting churn. Similarly, the Credit Card dataset highlighted the importance of transaction behavior, with Total Trans Ct and Total Trans Amt emerging as top fea58 Figure 5.9: Top 10 Feature Importance in Catboost, LightGBM, and Gradient Boosting for Bank Dataset. tures. In the E-Commerce dataset, features such as Tenure and NumberOfAddress were crucial, suggesting that customer loyalty and geographic stability significantly influence churn. The Employee dataset’s key features were Age and OverTime No, pointing to the impact of employee demographics and work patterns on churn. For the Internet dataset, the most important features were Remaining contract and bill avg, underscoring the significance of contract terms and billing amounts in customer retention. In the Membership dataset, the analysis revealed that features like MEMBERSHIP TERM YEARS, ANNUAL FEES, and MEMBER ANNUAL INCOME were the top predictors, suggesting that financial stability and membership tenure are key factors influencing churn. The Nigeria Telecom dataset emphasized the importance of usage and spending patterns, with Total Spend in Months 1 and 2 of 2017 and network age emerging as significant features. The SA Wireless dataset showed that Aggregate Total Rev, Aggregate SMS Rev, and network age were critical in predicting churn, pointing to the role of revenue and usage behavior. Other datasets, such as Telco Europa and Telecom, consistently showed that Tenure and ContractMonthtoMonth were significant, indicating the importance of contract types and customer retention over time. In the LightGBM model, financial and usage-related features were consistently impor59 Figure 5.10: Top 10 Feature Importance in Catboost, LightGBM, and Gradient Boosting for Credit Card Dataset. tant across different datasets. For example, in the Bank dataset, Balance and EstimatedSalary emerged as top predictors, highlighting financial factors’ influence on churn. The Credit Card dataset identified Total Trans Amt and Total Amt Chng Q4 Q1 as leading indicators, pointing toward the significance of transaction changes and amounts in churn prediction. The E-Commerce dataset emphasized the role of incentives and logistics, with CashbackAmount and WarehouseToHome identified as top features. In the Employee dataset, financial compensation, represented by features such as DailyRate and MonthlyIncome, was crucial in predicting employee churn. For the Internet dataset, billing and subscription duration were significant, with Bill avg and subscription age identified as the most important features. In the Cell2Cell dataset, changem and mou emerged as top predictors, highlighting the significance of customer behavior and usage patterns. The Membership dataset identified ANNUAL FEES and MEMBER ANNUAL INCOME as leading indicators, suggesting that financial aspects are crucial in predicting membership churn. In the Nigeria Telecom dataset, features such as Total Onnet spend and Total Data Consumption were identified as significant, pointing towards the importance of usage patterns in churn prediction. The SA Wireless dataset highlighted the role of aggregate revenue and data usage, with Aggregate SMS Rev and network age emerging 60 Figure 5.11: Top 10 Feature Importance in Catboost, LightGBM, and Gradient Boosting for E-Commerce Dataset. as top features. In both Telco Europa and Telecom datasets, Monthly Charge and Age were the top features, suggesting that billing and customer demographics are pivotal in churn prediction. The Gradient Boosting model also revealed critical insights into the drivers of churn across various datasets. In the Bank dataset, the focus was on product usage, with NumOfProducts 2 and NumOfProducts 1 emerging as the most important features. The Credit Card dataset highlighted the significance of revolving balances and transaction counts, with Total Revolving Bal and Total Trans Ct identified as top features. In the E-Commerce dataset, customer loyalty and complaints were significant predictors, with Tenure and Complain 0 as key features. The Employee dataset emphasized job roles and work-life balance, with JobRole Manufacturing Director and WorkLifeBalance 1 identified as critical features. The Internet dataset highlighted the importance of contract terms and usage patterns, with remaining contract and download avg emerging as top features. In the Cell2Cell dataset, the focus was on retention efforts, with retcalls 0 and retcalls 1 emerging as the most important features, indicating the importance of retention calls in predicting churn. The Membership dataset emphasized the importance of package types and occupation codes, with MEMBERSHIP PACKAGE TYPE-A 61 Figure 5.12: Top 10 Feature Importance in Catboost, LightGBM, and Gradient Boosting for Employee Dataset. and MEMBER OCCUPATION CD 2.0 identified as key features. In the Nigeria Telecom dataset, the analysis highlighted the significance of competitor networks, with Most Loved Competitor network in Month 2 Weematel and Month 2 Zintel emerging as top predictors. The SA Wireless dataset showed the importance of aggregate revenue and favorite network features, with sep fav a ufone and sep fav a telenor identified as critical. Similar patterns were observed in the Telco Europa and Telecom datasets, where Contract Month-to-Month and Number of Dependents were crucial in predicting churn. 62 Figure 5.13: Top 10 Feature Importance in Catboost, LightGBM, and Gradient Boosting for Internet Dataset. Figure 5.14: Top 10 Feature Importance in Catboost, LightGBM, and Gradient Boosting models for the Cell2Cell dataset. 63 Figure 5.15: Top 10 Feature Importance in Catboost, LightGBM, and Gradient Boosting models for the Membership dataset. Figure 5.16: Top 10 Feature Importance in Catboost, LightGBM, and Gradient Boosting models for the Nigeria Telecom dataset. 64 Figure 5.17: Top 10 Feature Importance in Catboost, LightGBM, and Gradient Boosting models for the SA Wireless dataset. Figure 5.18: Top 10 Feature Importance in Catboost, LightGBM, and Gradient Boosting for Telco Europa Dataset. 65 Figure 5.19: Top 10 Feature Importance in Catboost, LightGBM, and Gradient Boosting for Telecom Dataset. 66 Chapter 6 Discussion and Conclusion 6.1 Discussion The results of this study provide a comprehensive understanding of the performance of various machine learning models in predicting customer churn across multiple datasets. By adopting an individual participant data meta-analysis approach, this research has been able to integrate raw data, facilitating a nuanced comparison and validation of predictive models under varying conditions. 6.1.1 Variability in Model Performance The aggregated results reveal significant variability in the performance of machine learning models across different datasets and models. The fixed-effect model’s high heterogeneity (I2 of 68.76% by ML model and 97.55% by dataset) indicates substantial variability not accounted for by this model. This high heterogeneity suggests that the differences in effect sizes are not solely due to sampling error but also due to genuine differences in dataset characteristics or model capabilities. 67 The standard deviations and standard errors across different models and datasets also highlight this variability. For instance, the Employee dataset exhibited the highest standard deviation (0.0885) and standard error (0.0256), indicating significant variability in model performance. This suggests that some datasets are inherently more challenging for churn prediction, likely due to their unique characteristics or data quality issues. 6.1.2 Top Performing Models and Datasets Among the machine learning models, Catboost, Lightgbm, and Gradient Boosting consistently demonstrated high performance, with mean AUCs close to or above 0.90. These models showed relatively narrow confidence intervals, indicating consistent performance across various datasets. Their high performance can be attributed to their sophisticated ensemble techniques, which combine multiple weak learners to create a strong predictive model. These methods are particularly effective in handling complex relationships within the data and mitigating overfitting. Conversely, the Decision Tree model exhibited the lowest mean AUC and the highest variability, suggesting its limited effectiveness in handling diverse datasets compared to more sophisticated ensemble methods. The inherent simplicity of the Decision Tree algorithm, while offering interpretability, also leads to higher susceptibility to overfitting and poor generalization on unseen data. In terms of datasets, the Internet and Credit Card datasets achieved the highest mean AUCs, around 0.95, with low standard deviations and standard errors. This consistency indicates that the models performed well on these datasets, likely due to their inherent characteristics that align well with the models’ capabilities. Factors such as well-defined features, balanced target variables, and higher data quality might contribute to the superior performance observed on these datasets. On the other hand, the Membership dataset posed the most significant challenge, with the lowest mean AUC and higher variability, indicating that the models struggled 68 to effectively predict churn in this dataset. This could be due to several reasons, such as the presence of more noise, fewer relevant features, or a higher degree of class imbalance that complicates the learning process. 6.1.3 Fixed-Effect vs. Random-Effect Models The comparison between fixed-effect and random-effect models provides further insights. The fixed-effect model, which assumes a common effect size across studies, showed high precision but failed to account for the observed heterogeneity adequately. This model is useful when the studies are assumed to be functionally identical and the only variability is due to within-study sampling error. However, in real-world applications, this assumption is often unrealistic due to genuine differences in datasets and study conditions. In contrast, the random-effect model, which incorporates between-study variability, offered slightly lower mean AUC estimates with wider confidence intervals but provided a more realistic assessment by acknowledging the variability across datasets and models. This model is more appropriate when the datasets are not identical and there is a need to generalize the findings beyond the included studies. The random-effect model’s higher pooled effect size estimate for the dataset analysis (0.86 [0.81, 0.91]) compared to the fixed-effect model (0.85 [0.84, 0.86]) underscores the importance of considering heterogeneity in meta-analyses. This approach offers a more generalizable understanding of model performance, essential for practical applications in varied real-world scenarios. 6.1.4 Discussion on Funnel Plots Analysis The analysis of funnel plots presented in this study underscores several key insights into the behavior of machine learning models and the impact of dataset characteristics on model performance. The findings reveal both the utility of funnel plots in detecting 69 biases and their role in identifying models with optimal performance characteristics. The observed slight asymmetry in the funnel plots for both fixed-effect and randomeffect models by machine learning model suggests a minor bias toward models demonstrating high performance. This asymmetry, albeit small, could indicate a propensity for these models to perform better under specific conditions or with particular types of data, which might not generalize across broader applications. The consistent positioning of models like Catboost and LightGBM at the peak of these plots highlights their ability to achieve high accuracy while maintaining a balance between bias and variance, an essential feature for models intended for practical deployment. Conversely, the variability shown by models such as Decision Tree and KNeighbors could be symptomatic of their lower tolerance to changes in data characteristics or their susceptibility to overfitting. This aspect of model performance is crucial for developers to consider when choosing models for real-world applications, where data can vary significantly from the conditions under which models are trained. The symmetric distribution of points in the funnel plots by dataset, particularly in the fixed-effect model, confirms a balanced and unbiased evaluation of datasets. The high performance noted for datasets like Telco Europa and Telecom suggests that these datasets are well-suited to developing and testing machine learning models, likely due to their quality and representative nature. In contrast, the lower performance and higher variability associated with the Membership dataset highlight the challenges that can arise from datasets with specific characteristics or limitations. This observation stresses the importance of careful dataset selection and preparation in achieving reliable model performance. The application of funnel plots extends beyond traditional publication bias detection to a methodological tool for evaluating and selecting machine learning models based on their performance stability and generalizability. By identifying models that not only perform well but also show consistent results across different tests, practitioners can better select algorithms for deployment in varied settings. This approach also offers a 70 systematic way to test model robustness before practical application, potentially reducing the risk of deploying models that perform well in a controlled study but fail in real-world scenarios. 6.1.5 Feature Importance The feature importance analysis across various datasets using Catboost, LightGBM, and Gradient Boosting revealed consistent patterns and highlighted specific variables that are critical in predicting customer churn. Demographic Factors: Features such as Age, Tenure, and Membership Term Years appeared consistently across multiple datasets, underscoring the importance of demographic factors in predicting churn. This suggests that understanding the age distribution and customer loyalty can provide valuable insights into customer retention strategies. Financial Indicators: Financial-related features, including Balance, EstimatedSalary, and Annual Fees, were significant across various datasets. These findings highlight the role of financial stability in customer decision-making, where customers with higher financial stability may be less likely to churn. This indicates the importance of considering customers’ financial health when designing retention strategies. Usage Patterns: Usage-related features, such as Total Trans Ct, Total Revolving Bal, and Total Spend in Months 1 and 2 of 2017, were critical indicators of churn. These results suggest that customers who engage more frequently with the service or product are less likely to churn. Monitoring these patterns can help identify at-risk customers and design interventions to improve customer engagement and retention. Contractual and Behavioral Factors: Features related to contracts and customer behavior changes, such as Contract Month-to-Month and changes in service usage (e.g., changer and changem), were vital in predicting churn. These factors highlight the importance of flexible contract terms and monitoring customer behavior to reduce churn 71 rates. Companies should consider offering flexible contract options and closely monitoring customer behavior to preemptively address potential churn. Incentives and Support: Features like CashbackAmount and Retention Calls (retcalls) pointed to the significance of customer incentives and support in retention strategies. This suggests that companies should implement more personalized incentives and enhance customer support to improve retention. By focusing on these factors, businesses can better align their offerings with customer needs and preferences, thereby reducing churn. Overall, this analysis provides valuable insights into the factors driving customer churn across different industries and datasets. It also demonstrates the robustness of the top-performing models (Catboost, LightGBM, and Gradient Boosting) in identifying these critical features. These findings can be used to refine customer retention strategies, focusing on the key drivers of churn identified in this study. 6.1.6 Implications for Practice The insights gained from this meta-analysis have several practical implications. Firstly, the consistent high performance of ensemble methods like Catboost, Lightgbm, and Gradient Boosting suggests that these models should be prioritized in customer churn prediction tasks. Their ability to handle complex data structures and mitigate overfitting makes them well-suited for diverse datasets encountered in real-world applications. Secondly, the significant variability in model performance across different datasets highlights the need for dataset-specific strategies. For instance, datasets like Membership, which pose more challenges, may benefit from additional preprocessing steps such as feature engineering, advanced imputation techniques for missing values, and more sophisticated methods for handling class imbalance. Furthermore, the results underscore the importance of using random-effect models 72 in meta-analyses involving heterogeneous datasets. By accounting for between-study variability, these models provide a more accurate and generalizable estimate of effect sizes, which is crucial for informing practical decisions and ensuring the robustness of predictive models in diverse settings. 6.1.7 Contributions of This Research This research makes several significant contributions to the field of machine learning and customer churn prediction: • Methodological Advancement: By employing an IPD-MA approach, this study integrates raw data from multiple datasets, allowing for a more nuanced and robust comparison of machine learning models. This methodological advancement can be applied to other domains where meta-analysis of machine learning performance is required. • Handling Heterogeneity: The research demonstrates the importance of accounting for heterogeneity in meta-analyses by comparing fixed-effect and random-effect models. This contribution is crucial for developing more reliable and generalizable predictive models in varied real-world scenarios. • Practical Insights: The findings offer practical insights for practitioners in selecting and optimizing machine learning models for customer churn prediction. The detailed analysis of model performance across different datasets provides a valuable reference for tackling similar prediction tasks in industry. • Future Research Directions: The study identifies key areas for future research, such as exploring additional datasets, refining models, and incorporating more advanced machine learning techniques. These directions can guide further advancements in the field and improve predictive accuracy and generalizability. 73 6.1.8 Limitations and Future Research While this study provides valuable insights, it also has several limitations. The reliance on publicly available datasets means that the findings may not fully generalize to proprietary datasets with different characteristics. Additionally, the study focused on a limited set of machine learning models and did not explore the full spectrum of possible algorithms and their hyperparameters. Future research should aim to include a broader range of datasets, especially those from different industries and contexts, to validate the findings further. Additionally, exploring more advanced and hybrid machine learning models, as well as automated machine learning techniques, could yield even better performance and more robust insights. 6.2 Conclusion This study has demonstrated the effectiveness of the IPD-MA approach in evaluating the performance of machine learning models for predicting customer churn across multiple datasets. The findings highlight the importance of considering dataset characteristics and model capabilities in such analyses. Key takeaways from this research include: • High-Performing Models: Catboost, Lightgbm, and Gradient Boosting emerged as the top-performing models, consistently achieving high AUCs across various datasets. These models should be prioritized for churn prediction tasks in future applications. • Dataset Challenges: The variability in model performance across datasets underscores the need for tailored approaches. The Internet and Credit Card datasets demonstrated high model performance, whereas the Membership dataset posed significant challenges, highlighting the necessity for dataset-specific strategies. 74 • Meta-Analysis Models: The comparison between fixed-effect and random-effect models emphasized the importance of accounting for heterogeneity in meta-analyses. The random-effect model provided a more comprehensive understanding of model performance by incorporating between-study variability. Overall, this research contributes to the field by providing a robust methodological framework and actionable insights for improving churn prediction models. Future work should focus on exploring additional datasets and refining models to further enhance predictive accuracy and generalizability. By leveraging the findings of this study, practitioners can make informed decisions in selecting and optimizing machine learning models for effective customer churn prediction. 75 Bibliography [1] L. Geiler, S. Affeldt, and M. Nadif, “A survey on machine learning methods for churn prediction,” International Journal of Data Science and Analytics, vol. 14, no. 3, pp. 217–242, 2022. [2] P. Lalwani, M. K. Mishra, J. S. Chadha, and P. Sethi, “Customer churn prediction system: A machine learning approach,” Computing, pp. 1–24, 2022. [3] A. K. Ahmad, A. Jafar, and K. Aljoumaa, “Customer churn prediction in telecom using machine learning in big data platform,” Journal of Big Data, vol. 6, no. 1, pp. 1–24, 2019. [4] I. Ullah, B. Raza, A. K. Malik, M. Imran, S. U. Islam, and S. W. Kim, “A churn prediction model using random forest: Analysis of machine learning techniques for churn prediction and factor identification in telecom sector,” IEEE access, vol. 7, pp. 60 134–60 149, 2019. [5] S. A. Qureshi, A. S. Rehman, A. M. Qamar, A. Kamal, and A. Rehman, “Telecommunication subscribers’ churn prediction model using machine learning,” in Eighth international conference on digital information management (ICDIM 2013), IEEE, 2013, pp. 131–136. [6] D. S. Sisodia, S. Vishwakarma, and A. Pujahari, “Evaluation of machine learning models for employee churn prediction,” in 2017 international conference on inventive computing and informatics (icici), IEEE, 2017, pp. 1016–1020. 76 [7] M. Rahman and V. Kumar, “Machine learning based customer churn prediction in banking,” in 2020 4th international conference on electronics, communication and aerospace technology (ICECA), IEEE, 2020, pp. 1196–1201. [8] D. L. Garcı́a, À. Nebot, and A. Vellido, “Intelligent data analysis approaches to churn as a business problem: A survey,” Knowledge and Information Systems, vol. 51, no. 3, pp. 719–774, 2017. [9] D. Witten and G. James, An introduction to statistical learning with applications in R. springer publication, 2013. [10] A. Géron, Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow. ” O’Reilly Media, Inc.”, 2022. [11] C. B. C. Latha and S. C. Jeeva, “Improving the accuracy of prediction of heart disease risk based on ensemble classification techniques,” Informatics in Medicine Unlocked, vol. 16, p. 100 203, 2019. [12] L. Breiman, “Bagging predictors,” Machine learning, vol. 24, pp. 123–140, 1996. [13] P. Bühlmann and B. Yu, “Analyzing bagging,” The annals of Statistics, vol. 30, no. 4, pp. 927–961, 2002. [14] L. Breiman, “Random forests,” Machine learning, vol. 45, pp. 5–32, 2001. [15] Y. Freund, R. E. Schapire, et al., “Experiments with a new boosting algorithm,” in icml, Citeseer, vol. 96, 1996, pp. 148–156. [16] Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer, “An efficient boosting algorithm for combining preferences,” Journal of machine learning research, vol. 4, no. Nov, pp. 933–969, 2003. [17] J. H. Friedman, “Greedy function approximation: A gradient boosting machine,” Annals of statistics, pp. 1189–1232, 2001. [18] J. Friedman, T. Hastie, and R. Tibshirani, “Additive logistic regression: A statistical view of boosting (with discussion and a rejoinder by the authors),” The annals of statistics, vol. 28, no. 2, pp. 337–407, 2000. 77 [19] B. Sun, S. Chen, J. Wang, and H. Chen, “A robust multi-class adaboost algorithm for mislabeled noisy data,” Knowledge-Based Systems, vol. 102, pp. 87–102, 2016. [20] N. Asbai and A. Amrouche, “Boosting scores fusion approach using front-end diversity and adaboost algorithm, for speaker verification,” Computers & Electrical Engineering, vol. 62, pp. 648–662, 2017. [21] J. Haumahu, S. Permana, and Y. Yaddarabullah, “Fake news classification for indonesian news using extreme gradient boosting (xgboost),” in IOP Conference Series: Materials Science and Engineering, IOP Publishing, vol. 1098, 2021, p. 052 081. [22] S. Džeroski and B. Ženko, “Is combining classifiers with stacking better than selecting the best one?” Machine learning, vol. 54, pp. 255–273, 2004. [23] Y. Xiong, M. Ye, and C. Wu, “Cancer classification with a cost-sensitive naive bayes stacking ensemble,” Computational and Mathematical Methods in Medicine, vol. 2021, 2021. [24] A. Elnagar, R. Al-Debsi, and O. Einea, “Arabic text classification using deep learning models,” Information Processing & Management, vol. 57, no. 1, p. 102 121, 2020. [25] F. Pedregosa, G. Varoquaux, A. Gramfort, et al., “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011. [26] S. Bryan, “Weighting confusion matrices by outcomes and observations,” in 2020. [27] G. V. Glass, “Primary, secondary, and meta-analysis of research,” Educational researcher, vol. 5, no. 10, pp. 3–8, 1976. [28] M. Harrer, P. Cuijpers, T. Furukawa, and D. Ebert, Doing meta-analysis with R: A hands-on guide. Chapman and Hall/CRC, 2021. [29] R. D. Riley, P. C. Lambert, and G. Abo-Zaid, “Meta-analysis of individual participant data: Rationale, conduct, and reporting,” Bmj, vol. 340, 2010. [30] T. D. Pigott and T. D. Pigott, “Including individual participant data in metaanalysis,” Advances in Meta-Analysis, pp. 109–131, 2012. 78 [31] S. H. Ballew, Y. Mok, and K. Matsushita, “A practical guide to interpret individual participant data meta-analysis of observational studies,” American Journal of Kidney Diseases, vol. 78, no. 3, pp. 464–467, 2021. [32] J. Chandler, M. Cumpston, T. Li, M. J. Page, and V. Welch, “Cochrane handbook for systematic reviews of interventions,” Hoboken: Wiley, 2019. [33] A. Shabtai, Y. Elovici, L. Rokach, A. Shabtai, Y. Elovici, and L. Rokach, Data leakage detection/prevention solutions. Springer, 2012. [34] V. R. Joseph and A. Vakayil, “Split: An optimal method for data splitting,” Technometrics, vol. 64, no. 2, pp. 166–176, 2022. [35] E. Alshdaifat, D. Alshdaifat, A. Alsarhan, F. Hussein, and S. M. F. S. El-Salhi, “The effect of preprocessing techniques, applied to numeric features, on classification algorithms’ performance,” Data, vol. 6, no. 2, p. 11, 2021. [36] C. Seger, An investigation of categorical variable encoding techniques in machine learning: Binary versus one-hot and feature hashing, 2018. [37] R. DerSimonian and N. Laird, “Meta-analysis in clinical trials,” Controlled clinical trials, vol. 7, no. 3, pp. 177–188, 1986. 79 Appendix A Source Codes The data and code for this project is stored in this GitHub Repository. 80