Bob Gaglardi School of Business and Economics,
Thompson Rivers University

MBA Project

Rezoning ICBC Driver Performance to Reflect Localized Risk
A Machine Learning Algorithm

Sakshay Kapur
Supervisor: Dr. Mohammad Mahbobi
August 2025
1

Tables of Content
Abstract ............................................................................................................................................ 4
Introduction ...................................................................................................................................... 4
Research Questions .......................................................................................................................... 5
Literature Review ............................................................................................................................. 7
Methodology .................................................................................................................................... 9
Data ................................................................................................................................................ 13
Conclusion ..................................................................................................................................... 58
Future Scope and Limitations ........................................................................................................ 60
References ...................................................................................................................................... 62
Appendix ........................................................................................................................................ 65

2

List of Tables
Table-1: Descriptive Statistics of Estimated Premiums by Alternative Distribution ..................... 20
Table-2: Premium Descriptive Statistics by Types of Datasets ...................................................... 22
Table-3: Model Performance Metrics by Premium Type- RMSE.................................................. 28
Table-4: Model Performance Metrics by Premium Type, MAE .................................................... 28
Table-5: Model Performance Metrics by Premium Type, R2 ......................................................... 29
Table-6: Model Performance Metrics by Premium Type, 90% QAE ............................................ 29
Table-7: Model Performance Metrics by Premium Type, AUR ................................................... 29
Table-8: Model Performance Metrics by Premium Type, Accuracy .............................................. 29
Table-9: Model Performance Metrics by Premium Type, Precision .............................................. 30
Table-10: Model Performance Metrics by Premium Type, Recall................................................. 30
Table-11: Model Performance Metrics by Premium Type, F1 Score............................................. 30
Table-12: Threshold values for each ML model ............................................................................ 32
Table-13: Factor impact on Sensitivity Analysis ........................................................................... 44

List of Figures
Figure-1: ICBC Zoning Premium Data Generation Process .......................................................... 19
Figure-2: Comparison of Estimated Premium Distributions ......................................................... 21
Figure-3: RMSE by Regression Model and Dataset ...................................................................... 33
Figure-4: R2 by Regression Model and Dataset ............................................................................. 34
Figure-5: MAE by Regression Model and Premium Type Dataset ............................................... 35
Figure-6: 90% QAE by Regression Model and Dataset ................................................................ 36
Figure-7: AUC by Regression Model and Premium Type Dataset ................................................ 37
Figure-8: Accuracy by Regression Model and Premium Type Database ...................................... 38
Figure-9: Precision by Regression Model and Premium Type Database ....................................... 39
Figure-10: Recall by Regression Model and Premium Type Database ......................................... 40
Figure-11: F1- Score by Regression Model and Premium Type Database .................................... 41
Figure-12: Sensitivity Analysis Visualization ................................................................................ 43
Figure-13: Elbow Method and Silhouette Analysis ....................................................................... 47
Figure-14: BC ICBC Territories by Dominant KMeans_4 Cluster ............................................... 48
Figure-15: K- Distance Plot (Sample = 100000) Source: Authors’ calculations ......................... 51
Figure-16: DBSCAN Noise Territory Cluster (-1)......................................................................... 52
Figure-17: Territory-Level K-Distance Plot................................................................................... 53
Figure-18: DBSCAN updated Territory Cluster ............................................................................ 54
Figure-19: BC ICBC Territories by DBSCAN ............................................................................. 56
Figure-20: Hierarchical Clustering Dendrogram ........................................................................... 57
Figure-21 : BC ICBC Territories by Hierarchical Clusters ............................................................ 58

3

Abstract
The current insurance premium zoning system in British Columbia, managed by the Insurance
Corporation of British Columbia (ICBC), is based on fixed geographic boundaries that often fail
to reflect actual crash risk patterns. This static approach can group together regions with
significantly different risk profiles, leading to cross-subsidization and misaligned premium
structures. This study proposes a data-driven alternative that leverages detailed territory level
crash data to improve both fairness and accuracy in premium setting. Using a comprehensive
dataset encompassing crash severity, frequency, and contextual variables, four premium
generation methods such as structural, bimodal, normal, and uniform were tested.
To uncover regional risk patterns, clustering algorithms (K-Means, DBSCAN, and Hierarchical
Clustering) were applied and compared against ICBC’s existing territorial zones to identify
spatial inconsistencies. In parallel, supervised machine learning models (Random Forest,
XGBoost, LightGBM, and CatBoost) were used to predict premiums and evaluated using
standard performance metrics.
The results show that clustering reliably identifies territorial misalignments, and only the
structurally generated data preserves realistic pricing relationships. These findings suggest that a
machine learning driven rezoning framework can enhance actuarial precision, promote equity,
and support more transparent premium allocation across British Columbia.
Introduction
British Columbia’s road safety landscape is complex and varied. It has the urban side of
BC with Metro Vancouver to the remote highways of the North. Setting fair and accurate auto
insurance premiums is essential for maintaining public confidence, financial sustainability, and
regulatory compliance. In British Columbia, ICBC currently uses a territory-based zoning system
to adjust premiums by geographic area. These zones are defined by municipal and regional
boundaries, but they may not always reflect actual driving risk as indicated by crash statistics. As
urban development and traffic patterns evolve, these static zones risk becoming misaligned with
real world crash data. When areas with differing risk profiles are grouped together, it can lead to
cross-subsidization where drivers in lower risk regions pay more to offset higher risk areas. This
can reduce the fairness of the pricing system and weaken incentives for road safety improvements.
To help ICBC better understand and price that risk, this study analyzed ICBC’s 14 territories using
publicly available data. The focus was on zone definitions, premium levels, and underlying factors
such as accident counts, violation tickets, vehicle types and features, and regional characteristics.
The process began with raw crash reports and concluded with mapped territory clusters and
premium prediction models. The project had two main goals: first, to examine how premium levels
4

are currently determined and how they relate to performance indicators and second, to explore
potential re-zoning strategies that could better align insurance pricing with actual crash risk.
An insurance portfolio groups policyholders by shared risk characteristics and sets premiums
accordingly. If everyone paid the same rate, low-risk drivers would defect to cheaper alternatives,
leaving the insurer with disproportionately high-risk customers. To avoid this, insurers assign risk
factors and classify policyholders into tiers where each tier’s premium reflects its aggregate risk
(Henckaerts et al., 2018). This stratification underpins our premium classification approach.
This study looks at ways to improve ICBC’s current zoning system by using a more data informed
approach to insurance pricing. It uses a detailed crash dataset to simulate different premium
outcomes. Four types of Estimated Premiums were calculated. Those were Normal, Structural,
Bimodal, and Uniform. Each based on different assumptions about how crash risks are spread
across regions. These premiums combined base rates, crash severity factors, territory level risk
indicators and fixed surcharges, and were summarized by territory to show how they vary across
the province.
The goal was to see if alternative methods could highlight regional differences that might be
overlooked under the current zoning setup. To compare the results, the study used bar charts,
density plots, and sensitivity tests, including minor changes to key inputs (±10%) to check how
stable the premium estimates were. Five machine learning models such as Linear Regression,
Random Forest, XGBoost, LightGBM, and CatBoost were tested to see how well territory level
data could predict premiums. These models offer a way to measure performance and support the
broader aim of creating a fairer and more accurate insurance pricing system. Since risk does not
always follow municipal boundaries, the study also used clustering methods such as K-Means,
DBSCAN, and Hierarchical Clustering to find areas that stand out from their surroundings. These
techniques helped identify naturally occurring groups of higher or lower risk, which could be
useful for thinking about future zoning changes.
Research Questions

5

•

How do performance parameters such as crash type, number of drivers, and infractions
relate to insurance premium zoning?

•

Can machine learning algorithms be effectively used to analyze and potentially rezone
insurance premiums based on risk characteristics?

The structure of this paper is as follows. It begins by examining the limitations of ICBC’s current
territory-based zoning system for auto insurance pricing. It touches on the motivation behind to
explore whether a more data informed approach could lead to fairer and more accurate premium
assessments across British Columbia. The next section presents the core dataset used in the study,
a detailed collection of crash records enriched with regional indicators. Using this data, four
versions of Estimated Premiums: Normal, Structural, Bimodal, and Uniform were generated. Each
version reflects a different assumption about how crash risk is distributed and incorporates base
rates, severity multipliers, territory level risk factors, and fixed surcharges. These premiums are
then aggregated by territory to reveal spatial variation. Following premium simulations, the report
evaluates whether these alternative methods expose regional differences that may be hidden under
the current zoning system. To support this analysis, visual tools such as bar charts and kernel
density plots are used, along with sensitivity testing that applies ±10% changes to key inputs to
assess the robustness of the premium estimates. The report explores the predictive power of
territory level features using five supervised machine learning models: Linear Regression, Random
Forest, XGBoost, LightGBM, and CatBoost. This helps to estimate premiums and supports the
broader goal of building a more regionally responsive and actuarially sound pricing system. In the
final analytical section, the report then shifts focus to spatial clustering. Recognizing that crash
risk may not align neatly with municipal boundaries, it applies three clustering algorithms such as
K-Means, DBSCAN, and Hierarchical Clustering to identify areas that behave differently from
their neighbors. These techniques help uncover organically emerging high and low risk clusters,
offering insights into potential rezoning opportunities.
The report concludes by summarizing key findings and discussing how data driven insights could
inform future policy decisions around insurance zoning and premium setting. It emphasizes the
potential for improved fairness, transparency, and alignment between pricing and actual crash risk.

6

Literature Review
Early developments in insurance analytics have predominantly relied on statistical and time
series methodologies to model claim behavior and inform premium structures. Generalized Linear
Models (GLMs) have been widely adopted to relate claim frequency to driver characteristics such
as age and driving experience, with the objective of balancing profitability and market
competitiveness (David, 2015). Generalized Additive Models (GAMs) extend the capabilities of
GLMs by incorporating non-linear smooth functions, thereby capturing more intricate risk patterns
in pricing models. For analyzing temporal trends, Autoregressive Integrated Moving Average
(ARIMA) models have been employed to examine claim volumes and seasonal fluctuations in
historical data, supporting actuarial adjustments to premium rates over time. Artificial Neural
Networks (ANNs) represent a more flexible modeling approach, capable of capturing high
dimensional interactions among risk factors. These models have demonstrated improved accuracy
in predicting pure premiums and have outperformed traditional techniques such as ARIMA in
claim-amount forecasting tasks (Selvakumar et al., 2021).
Despite these methodological advancements, several critical gaps remain in literature. One major
limitation is the insufficient integration of individual performance indicators such as traffic
violations, accident counts, number of drivers per vehicle, and driver age into the geospatial zoning
frameworks that underpin premium setting. Henckaerts et al. (2018) underscore the challenges of
incorporating continuous and spatial risk factors into pricing models but do not explicitly address
how personal and behavioral metrics can inform territory-based rate classifications. Similarly,
Eriksen and Jones (1972) acknowledge the complexities of fair premium setting in regulated
markets, yet their work does not extend to the spatial mapping of performance parameters.
Furthermore, current research often overlooks the issue of bias and transparency in machine
learning models. Although complex models like ANNs and hybrid ARIMA-ANN frameworks
offer strong predictive capabilities, their lack of interpretability complicates insurers’ ability to
justify pricing decisions, raising concerns about fairness and regulatory compliance.
Another notable gap lies in the limited use of longitudinal data to assess evolving risk profiles over
time. Most studies rely on cross-sectional datasets, which fail to capture the dynamic nature of
7

individual risk across policy lifecycles. Incorporating time series analysis into predictive
frameworks could significantly enhance the accuracy of indemnity forecasts and premium
adjustments. Additionally, while Selvakumar et al. (2021) explored a range of models including
linear regression, exponential smoothing, ARIMA, and ANN, they did not investigate more recent
deep learning architectures such as convolutional neural networks (CNNs), transformers, or
ensemble methods like stacking, boosting, and bagging, which may offer further improvements in
predictive performance. The geographic scope of existing research also remains narrow. Many
studies, such as Selvakumar et al. (2021), concentrate on individual national markets (e.g., India)
and do not incorporate cross-country comparisons that reflect varying regulatory and economic
contexts.
Contemporary literature further builds on these foundations by applying machine learning
techniques directly to pure premium modeling. Kumar et al. (2024) conducted a comparative
evaluation of four algorithms of GLM, AGLM, XGBoost, and neural networks, using a French
automotive liability dataset. Through 5-fold cross validation, they assessed model performance
with metrics such as MAE, RMSE, and decile chart analysis. Their findings revealed that while
traditional GLMs remained competitive, XGBoost and neural networks outperformed in predictive
accuracy and segmentation of risk across quantiles. However, they caution that complex models
like neural networks may sacrifice interpretability, which is crucial for regulatory transparency and
operational deployment (Kumar et al., 2024).
Feature selection, a crucial step in model development, has also seen advancements through
sensitivity-based methods. Saranya and Pravin (2021) introduced a variance-based sensitivity
analysis (VSA) technique that identifies optimal feature subsets for disease classification tasks.
Their approach demonstrated superior accuracy and sensitivity compared to wrapper-based
selection methods. Although applied in the healthcare domain, this method offers transferable
potential to insurance pricing, particularly in identifying the most impactful predictors in large
scale datasets (Saranya & Pravin, 2021).
In this study, feature selection serves a practical function during the data preparation phase,
following the approach outlined by Saranya and Pravin (2021). As crash statistics are aggregated
8

by territory, variables such as crash frequency, severity, percentage of fatal incidents, and average
driver age are emphasized as primary predictors for risk profiling. Meanwhile, variables deemed
less impactful or potentially redundant are filtered out through a combination of statistical analysis
and domain expertise. This target selection helps reduce noise in the clustering and classification
stages, leading to more efficient computation and enhancing the clarity and consistency of the
resulting territorial risk groupings.
In parallel, the detection of fraudulent claims remains a priority for insurers seeking to reduce
financial loss and administrative overhead. Kowshalya and Nandhini (2018) addressed this issue
using synthetic datasets and implemented Naïve Bayes, J48, and Random Forest classifiers. Their
results showed that Random Forest excelled in classifying fraudulent claims with an accuracy
exceeding 99%, while Naïve Bayes performed best in estimating premium percentage
classifications. Their work highlights the importance of preprocessing, attribute selection, and
algorithm choice in building reliable fraud detection systems, which are foundational for ensuring
fair pricing and operational integrity (Kowshalya & Nandhini, 2018).
Taken together, these studies underscore a growing shift toward data-driven, machine learning
based insurance modeling. Yet, challenges related to data transparency, model interpretability,
integration of behavioral metrics, and longitudinal risk tracking remain areas for future exploration.
Bridging these gaps will be essential to advancing fair, explainable, and adaptive premium zoning
frameworks.
Methodology
In response to the existing limitations in literature, the present study utilizes a substantially
larger dataset comprising over one million anonymized ICBC policy and crash records, enabling
more sophisticated feature engineering such as territory level risk profiles and severity multipliers
and more rigorous model validation. Crash records are assigned to ICBC’s 14 existing territorial
zones, and clustering algorithms including K-Means, DBSCAN, and hierarchical clustering are
applied to identify spatial patterns (Henckaerts et al., 2018). These clustering results are then
evaluated using a suite of regression models to validate and refine zoning strategies. By combining
large-scale spatial analysis with detailed performance indicators and longitudinal data, this
9

research advances the methodology for premium classification and territorial rezoning, offering a
data driven framework that improves existing actuarial and statistical approaches.
Addressing these gaps through large-scale, bias aware, longitudinal ML models that marry
transparency with advanced algorithms can advance the science of premium zoning and support
more equitable, data-driven insurance pricing strategies.
Building upon the foundation laid by previous research, this project seeks to address notable gaps
by integrating machine learning with geospatial analysis to refine insurance premium zoning.
Although earlier studies like Kumar et al., (2024) have highlighted the predictive capabilities of
models like GLMs, XGBoost, and neural networks for estimating pure premiums, few have
explored the combined use of territorial clustering and spatial segmentation alongside individual
level risk factors such as crash severity, etc. Additionally, similar to Saranya and Pravin (2021)
approach, sensitivity analysis is an important tool for guiding feature selection and has seen limited
application in spatial premium modeling. To bridge these gaps, this research introduces a data
driven zoning framework that aggregates crash data across geographic regions, calculates
frequency and severity metrics, and assigns risk categories using clustering and classification
techniques. These clusters are then mapped to existing ICBC territorial boundaries, with premium
multipliers applied based on publicly available base rates. The goal is to evaluate the potential of
machine learning driven rezoning to produce fairer, more actuarially sound premium structures.
The following section details the methodology used to prepare the data, construct risk profiles,
perform clustering, and assess pricing outcomes across territorial zones.
The publicly available ICBC crash, and contravention datasets were sourced and merged into a
single master table comprising 1,048,575 incident records. During this consolidation, each file
underwent an iterative process of cleaning, addressing missing values, standardizing field formats,
and removing duplicates before being combined. The resulting dataset includes attributes such as
Crash Breakdown, Year, Month, Day of Week, Time Category, Crash Severity, Crash Type, Total
Crashes, Total Victims, geographic descriptors (Municipality Name, Region, Territory,
Municipality Boundary, Street Full Name, Cross Street Full Name, Latitude, Longitude), and
binary flags for animal involvement, cyclist involvement, motorcycle involvement, heavy vehicle
10

involvement, pedestrian involvement, parked vehicle involvement, intersection crashes, and midblock crashes.
To prepare the data for premium zoning analysis, each incident was assigned to its corresponding
ICBC territory based on municipality and region identifiers. The crash contraventions were run
against each crash territory by checking “Municipality with Boundary” then “Municipality” then
“Region”. If all three are empty, they were labeled as “Unknown”. The data were then aggregated
by territory to calculate total incident counts, as well as subtotals for fatal crashes, casualty crashes,
and property damage only crashes. Additional aggregations captured the number of crashes
involving pedestrians, cyclists, motorcycles, and those occurring at intersections.
For each territory, the proportions of fatal, casualty, and intersection crashes were computed
relative to total crash volumes. These proportions were translated into categorical risk zones: Low,
Medium, High, and Very High, using predefined thresholds for both crash frequency and crash
severity (fatality rate). Finally, by combining the frequency and severity classifications, an overall
risk rating (ranging from Low Risk to Very High Risk) was assigned to each territory, providing a
structured framework for subsequent premium zoning applications. In simple words, it calculates
crash severity and crash frequency which helps to define overall risk category.
Once Risk profiles were created, they need to be linked to territory which reloads the full crash
dataset, repeats the territory assignment, and then groups by territory to count crashes, fatalities,
pedestrian and cyclist involvement. Then it converts those counts into percentage buckets (e.g.
under 2% fatal = “Low Severity,” etc.) based on preset thresholds and using a lookup table that
combines frequency and severity buckets, it assigns each territory a final “Risk Category.” And if
the threshold choices are off, many territories may land in unexpected categories, so a few manual
random checks were done before proceeding ahead.
The consolidated crash dataset was now annotated with risk profiles which were first classified by
ICBC territory code to enable territory specific premium calculations. The Base Premium for each
record was computed as the product of a Severity Factor and a Risk Factor. Subsequent variables
required by ICBC’s Basic Insurance Tariff were then derived (ICBC, 2020, p. 4) as:

𝑃𝑟𝑒𝑚𝑖𝑢𝑚 = [𝐵𝑅𝑃 ∗ ∏6𝑖=1 𝐹𝑖 ] + 𝐿𝑃 + 𝑈𝐷𝑃𝑃 + 𝑈𝐷𝐴𝑃

(1)

where:
11

BRP = Base Rate Premium
F1 to F6 are defined as:
-

CDF = Combined Driver Factor.

-

DDF = Disability Discount Factor.

-

HVVCF = High Value Vehicle Charge Factor.

-

ASTF = Advanced Safety Technology Factor.

-

DF = Distance Factor.

-

TF = Transition Factor; and

LP = Learner Premium.
UDPP = Unlisted Driver Protection Premium.
UDAP = Unlisted Driver Accident Premium
To maintain the confidentiality of ICBC’s actual pricing data, in this study four data generation
methods Structural, Normal, Bimodal, and Uniform were used to simulate premium values. Since
real premium data includes sensitive and proprietary information, these datasets were designed to
reflect the general structure and behavior of pricing patterns without disclosing any private or
regulated details.
This allowed for meaningful analysis using clustering and regression techniques, while respecting
privacy and confidentiality guidelines. By comparing these variants, the study aimed to isolate
whether the observed patterns are driven by real structural signals in the data or are merely artifacts
of randomness or noise, ensuring that findings are not overly dependent on a specific statistical
distribution. Within each method, a few assumptions were made to keep things consistent. The
Heavy Vehicle Crash Factor (HVVCF) was set using a simple flag, heavy vehicles got a multiplier
of 2, while non-heavy ones stayed at 1. Since there wasn’t enough data for Liability Penalty (LP)
and Unlisted Driver Premium Penalty (UDPP), both were set to zero. Other adjustment factors like
Driver Disqualification Factor (DDF), At Fault Severity Tuning Factor (ASTF), Driver Factor (DF),
and Territory Factor (TF) were all defaulted to 1, meaning no extra tweaks were applied.
The premium calculation followed a clear step-by-step process. First, the Base Premium was
calculated by multiplying the base rate, severity factor, and territory risk factor. Then came the
Crash Damage Factor (CDF), which varied depending on the crash type head on collisions which
12

could bump it up to multiple of 1.5, while smaller incidents had lower values. The Unlisted Driver
Attribution Penalty (UDAP) was figured out by taking the probability that an unlisted driver
caused the crash and multiplying it by 1,000.
Equation (1) pulls together crash severity, territory risk, and vehicle-related details to give a wellrounded premium estimate, which then feeds into the machine learning models for zoning analysis.
To make the premium calculations work, the crash data was combined with a risk profile file
created earlier and matched with ICBC territory codes, which were simplified into single-letter
labels. For each code, the system looked up a base rate, applied a crash severity multiplier, and
added a territory risk multiplier. Fixed surcharges like those for heavy vehicles, at fault drivers,
and crash types were added in, and the final premium values were written out. These completed
estimates were then ready for use in the next phase of analysis.
Data
The dataset was compiled from publicly available ICBC crash records across British
Columbia, covering a five-year span from 2019 to 2023. The 1,048,575-incident records data
include reported vehicle collisions aggregated at the territory level, aligned with ICBC’s existing
premium zoning regions. Each entry contains information such as crash severity, intersection
involvement, and other territory level factors relevant to contravention. The dataset encompasses
a wide range of driving environments, from urban areas like Vancouver, Surrey, Burnaby, Kelowna,
and Victoria to more rural or remote regions such as Peace River North, Cariboo, and the
Kootenays. Crash incidents were grouped to help identify temporal patterns, account for seasonal
variation, and improve the reliability of model estimates. This territorial crash summary serves as
the foundation for the estimated premium calculations, clustering, and spatial analysis carried out
in the study.
To enable thorough analysis while maintaining the confidentiality of ICBC’s actual pricing data,
the study uses four premium generation methods such as Uniform, Bimodal, Normal, and
Structural. Each method was designed to reflect a different statistical pattern. Uniform and
Bimodal distributions provide baseline scenarios to test model behavior under simplified or
randomized conditions. The Normal variant introduces a more realistic distribution, capturing
typical variation in pricing without replicating proprietary structures. The Structural method, on
13

the other hand, incorporates crash severity, frequency, and territory level risk factors to more
closely resemble how premiums might be calculated in practice. By comparing model and
clustering results across all four approaches, the study assesses which methods best capture
meaningful pricing relationships and whether the findings hold up under different data
assumptions.
Uniform Data Generation
Using a uniform random selection approach for data generation means that every possible value
within a defined range has an equal chance of being selected unless specific weights are manually
applied. This method ensures a diverse spread of data points, but it doesn’t necessarily mirror real
world distributions. Instead of relying on complex statistical shapes like a normal curve or patterns
drawn from actual datasets, values for each factor are simply chosen at random within a set range.
For instance, the Disability Discount Factor (DDF) might fall anywhere between 0.8 and 1.2, while
the Safety Tech Discount Factor (ASTF) ranges from 0.9 to 1.1. The Distance Factor (DF) is
allowed to vary from 0.5 to 1.5, and the Transition Factor (TF) from 0.9 to 1.1. For the High Value
Vehicle Crash Factor (HVVCF), heavy vehicles are assigned a value between 1.8 and 2.2, while
non heavy vehicles fall between 0.8 and 1.2. The Crash Damage Factor (CDF) is adjusted by
randomly shifting the original value up or down by as much as 10%, and the Unlisted Driver
Attribution Penalty (UDAP) follows the same logic, but with a wider margin of up to 20%.
This kind of uniform data generation offers several advantages. First, it provides clarity and control,
avoiding any hidden patterns that might be embedded in historical data. Second, the ranges are
realistic wide enough to reflect meaningful changes, such as a discount increasing or decreasing
by 20%. Third, the process is repeatable, i.e., by using the same random seed, the exact same set
of values can be regenerated for consistency. Finally, it allows for fair comparisons. Since all four
data generation methods uniform, bimodal, normal, and structural follow the same setup but differ
in how they select values, it’s possible to compare their outcomes side by side and understand how
the choice of distribution influences premium predictions.
Structural data generation

14

Structural equation models reflect a causal relationship, not just a statistical correlation. These
models are often used in settings where controlled experiments aren’t possible, and they typically
involve simultaneous relationships and/or measurement errors. These errors might come from
basic inaccuracies in how something is measured, or from the fact that what we can observe doesn’t
exactly match the theoretical concepts we care about (Goldberger, 1972).
Importantly, the structural parameters in these models are not the same as the coefficients you may
get from simply running regressions on observed data though the model still enforces certain limits
on those regression coefficients. This creates subtle problems around identification (figuring out
the true relationships), which requires sophisticated statistical techniques to handle properly
(Goldberger, 1972).
Every factor in the pricing model is derived directly from specific fields in the dataset, with no
randomness involved. This approach ensures the system remains fully explainable and produces
consistent results every time it runs. The Disability Discount Factor (DDF) is fixed at 1.0, meaning
disability-related discounts are not considered in this case. The At Fault Safety Tech Factor (ASTF)
is calculated from a Tech Score ranging from 0 to 1. This score is converted into a discount using
the formula: 1 minus 0.1 times the Tech Score, and then clamped between 0.9 and 1.0, so vehicles
with more advanced technology receive a slightly better discount. The Distance Factor (DF)
increases by 5% for each additional crash victim, with a maximum cap of 3.0, reflecting the idea
that more victims typically lead to higher costs. The Transition Factor (TF) accounts for the year
of the crash, adding 0.5% to the rate for every year after 2018 to capture gradual pricing drift over
time. For vehicle classification, the High Value Vehicle Crash Factor (HVVCF) is set at 2.0 for
heavy vehicles and 1.0 for all others, effectively doubling the cost for heavy vehicles. Crash
severity influences the Combined Driver Factor (CDF), which is set at 1.0 for low severity, 1.1 for
medium, and 1.2 for high severity crashes. The Unlisted Driver Attribution Penalty (UDAP) is
calculated by multiplying the probability of the driver being at fault by $1,000, representing the
added risk. The Unlisted Driver Premium Penalty (UDPP) is a flat $50 for crashes involving injury,
and $0 for property only damage. Lastly, the Learner Penalty (LP) is a fixed $100 charge if the
driver is a learner, and nothing if they are not. This rule-based structure makes the model
transparent and easy to audit, with each factor clearly tied to a specific input. Thus, it helps to
15

achieve clarity, repeatability, and real-world logic. Any other factor such as victim count or year
can also be pointed out and explained exactly why a premium changed. Appendix A.1 provides
similar formulation.
Bimodal data generation
Instead of just picking numbers randomly, this method assumes there are two types of people or
situations for each factor like careful drivers vs. riskier ones. So, two groups (called a “bimodal”
mix) were mixed with different averages and spreads, which gives the data more shape and makes
it feel more real. Each factor reflects patterns observed in the data, offering a more realistic and
nuanced view of driver and vehicle behavior. The Disability Discount Factor (DDF) shows that
about 70% of drivers are generally safer, with scores around 0.90, while the remaining 30% are
riskier, closer to 1.10. This split suggests that most people drive cautiously, though not everyone
fits that mold. The At Fault Safety Tech Factor (ASTF) captures a mix of vehicles, some equipped
with advanced safety technology that earns them a slight discount, and others without those
features, receiving no bonus. The Distance Factor (DF) is evenly split, with half of the crashes
occurring in urban areas where trips are shorter, and the other half in rural regions where trips tend
to be longer.
Crash year adjustments, represented by the Transition Factor (TF), are mostly neutral, but a few
cases receive a small bump to account for pricing drift over time. For the High Value Vehicle Crash
Factor (HVVCF), both heavy and non-heavy vehicles are distributed across slightly higher and
lower cost ranges, reflecting variability in how vehicle type influences premiums. The Crash
Damage Factor (CDF) follows expected patterns for about 70% of cases, based on crash type,
while the remaining 30% are flagged as riskier, suggesting more severe or unusual incidents.
Finally, the Unlisted Driver Attribution Penalty (UDAP) shows that most drivers incur a standard
surcharge, but around 40% face steeper penalties, likely due to higher probabilities of being at fault.
Together, these distributions paint a more human picture of risk, one that’s not perfectly uniform
but grounded in realistic variation. Because real people aren’t all the same. There’s usually one
main group and a smaller different group. This way of generating numbers captures that. It’s more
realistic than just random guessing, but still more flexible than using set formulas. It helps models

16

learn patterns while still reflecting real-world variety and formulae can be referred to Appendix
A.2.
Normal Distribution
This modeling approach relies on the principle that most values tend to cluster around a normal
baseline, with only modest variation. Each factor is centered around 1.0 or its own base value and
allowed to fluctuate slightly, following a bell-shaped curve characteristic of the normal distribution.
This structure ensures consistency while still capturing realistic differences. For example, the
Disability Discount Factor (DDF) hovers around 1.0, with shifts of about ±10% to reflect
individual driver differences enough to introduce variation without destabilizing the model. The
Safety Tech Discount Factor (ASTF) varies even less, typically within ±5%, since most vehicles
today share similar safety technologies. In contrast, the Distance Factor (DF) shows greater
variability, up to ±20%, acknowledging that driving habits, especially trip lengths can differ widely
across individuals.
The Transition Factor (TF) introduces a gentle drift of around ±5%, simulating the passage of time
and adding a dynamic element to the model across crash years. For vehicle classification, the HighValue Vehicle Crash Factor (HVVCF) starts at 1.0 for standard vehicles and 2.0 for heavy ones,
with a touch of noise added to prevent overly rigid categorization. Crash severity is captured
through the Crash Damage Factor (CDF), which is anchored to a baseline tied to crash type and
allowed to vary by ±10% to reflect the inherent unpredictability of real-world incidents. Finally,
the At-Fault Penalty (UDAP) begins from a fixed base and fluctuates by about 10%, offering a
flexible way to represent varying degrees of responsibility in crash scenarios. These controlled
variations help the model remain grounded while still allowing for meaningful distinctions across
cases.
This approach is not only intuitive but also computationally efficient when the design matrix is
well-balanced. As Healy and Westmacott (1956) noted, exact least squares computations are
straightforward for special design matrices that incorporate balance and orthogonality. However,
when data is missing or the design is unbalanced, computations become more complex, often
17

requiring the inversion of large matrices. In such cases, it is natural to fill in missing values with
their expected values based on current parameters, then re estimate using a simple least-squares
algorithm, iterating until the estimates stabilize. More broadly, it may even be possible to add
hypothetical rows to the design matrix rows that never existed in the real world in a way that
facilitates least-squares analysis.
This idea aligns well with the model’s use of normal distributions to smooth out irregularities and
maintain analytical tractability. It makes sense if you imagine each number being shaped by
countless subtle influences like driving habits, location, and vehicle type while avoiding extreme
outliers by keeping everything within sensible bounds and formulae can be referred to Appendix
A.3.

Figure-1 illustrates the premium data generation process. Due to the confidentiality issue and lack
of access to the ICBC premium data, real crash data are preprocessed and mapped to ICBC territories
to estimate base premiums. Factor values are then generated using four distinct distributions as
Normal, Bimodal, Uniform, and Structural and combined with fixed loadings to derive the estimated
premiums.
Factor values such as the Disability Discount Factor (DDF), Advanced Safety Tech Factor (ASTF),
Distance Factor (DF), Transition Factor (TF), High-Value Vehicle Crash Factor (HVVCF), Crash
Damage Factor (CDF), and At Fault Penalty (UDAP) are generated based on the selected data
generation methods. The estimated premiums are then computed by applying the generated factors
to the base premium and adding fixed loadings for the Learner Premium (LP) and Unlisted Driver
Protection Premium (UDPP). The result is a set of estimated premium datasets.

18

Figure-1: ICBC Zoning Premium Data Generation Process

Source: Authors

Table-1 presents statistical characteristics of four premium distributions of Bimodal, Normal,
Structural, and Uniform. The Average Premium column shows the original mean values, while the
Mean (μ), Standard Deviation (σ), Skewness, and Kurtosis offer deeper insight into each
distribution’s structure. Mean values are consistent with earlier averages. Normal and Uniform
distributions display the greatest variability (σ = 667 and 660), while the Structural distribution is
more tightly clustered (σ = 380), reflecting its design to generate realistic pricing. All distributions
show positive skewness, indicating a tendency toward higher premium outliers, with the Normal
distribution being the most skewed (1.33). Kurtosis values reveal that the Normal distribution has
the sharpest peak and heaviest tails (3.79), whereas the Structural distribution is the flattest (0.22),
making it most stable. These metrics collectively illustrate the unique behavior of each dataset.

19

Table-1: Descriptive Statistics of Estimated Premiums by Alternative Distribution
Distribution

Average Premium

Mean (μ)

( $)

Standard

Skewness

Kurtosis

Deviation(σ)

Bimodal

2,134.28

2128.43

615.52

0.73

0.72

Normal

2,128.15

2127.44

667.36

1.33

3.79

Structural

1,888.56

1886.55

380.59

0.74

0.22

Uniform

1,975.75

1969.88

660.08

0.79

0.76

Source: Authors’ calculations

Kernel Density Estimation (KDE)
Kernel Density Estimation (KDE) is a widely used nonparametric technique for estimating the
probability density function (PDF) of a continuous variable based on a finite sample. Unlike
histogram-based methods, KDE provides a smooth and continuous density curve by assigning a
kernel function to each data point and aggregating their contributions across the domain. This
approach avoids the abrupt bin edges of histograms and offers greater flexibility in revealing the
underlying distributional structure of the data. At its core, KDE evaluates the density at a point by
summing the influence of all sample observations xi, each weighted by a kernel function centered
at that observation. The general form is given by equation (1):

(1)

𝑓(𝑥) = (1/𝑛) 𝛴ᵢ₌₁ⁿ 𝐾𝑒𝑟𝑛𝑒𝑙 (𝑥, 𝑥ᵢ)

Here, n is the number of data points, and the kernel function defines how each data point
contributes to the estimate at location x. The kernel function must satisfy the condition:
0 ≤ Kernel (x, xᵢ) < ∞, and all values of x and xᵢ must be within a bounded range.
In practice, symmetric kernel functions are most used. A symmetric kernel function can be written
as equation (2):
𝐾𝑒𝑟𝑛𝑒𝑙 (𝑥, 𝑥ᵢ) = (1/ℎ) · 𝐾 ((𝑥 − 𝑥ᵢ)/ℎ)

(2)

20

where K is the kernel shape (e.g., Gaussian, uniform, triangular), and h is the bandwidth (or
smoothing parameter). Substituting this into the original formula gives the standard symmetric
KDE form as equation (3):
𝑓(𝑥) = (1 / (𝑛 · ℎ)) 𝛴ᵢ₌₁ⁿ 𝐾 ((𝑥 − 𝑥ᵢ)/ℎ)

(3)

The bandwidth h plays a critical role in shaping the final estimate. A smaller bandwidth results in
a highly detailed curve with sharp local variations, while a larger bandwidth produces a smoother
and more generalized density function that may obscure important local features. Bandwidth
selection is therefore crucial and can be determined through methods such as cross-validation or
rule of thumb estimators like Silverman's rule (Węglarczyk, 2018).
Overall, KDE offers a powerful, flexible approach to analyzing the shape of empirical data
distributions, making it particularly useful in domains such as insurance, where underlying risk
patterns may not follow standard parametric forms. KDE was plotted to help visualize how
premium values are distributed. Each distribution is shown separately, making it easy to compare
their shapes. Figure-2 shows four smooth distributions, each representing a different premium
distribution, helping to understand how premiums vary under different scenarios.
Figure-2: Comparison of Estimated Premium Distributions

Source: Authors’ calculations

21

Similarly, all four distribution datasets again were leveraged, and each row was tagged with its
distribution name (Bimodal, etc.) and to find the one Estimated Premium column in each dataset
and copied its values into a common Estimated Premium column to merge all four into one Data
Frame. Each premium can be adjusted by increasing the crash-type factor (CDF) by 10% to assess
how sensitive the premium estimates are to changes in that parameter as in Figure-2. This
adjustment is applied using the formula (CDF*1.10)/CDF) which effectively scales the premium
based on the increased severity factor. It is important to note that this formula assumes the CDF
value is never zero. If CDF were zero, the calculation would result in a division by zero error or
produce undefined values (NaNs), which could disrupt the analysis. Therefore, ensuring that CDF
is always positive and properly defined is essential for maintaining the integrity of the sensitivity
testing. The data is grouped by distribution type, which helps compare different premium
distributions.
Premium Values
The key statistics for the Final Premium values have been calculated to provide a comprehensive
overview of data distribution. These include the count, which represents the total number of
records, the mean which indicates the average premium and the standard deviation which reflects
how much the premium values deviate from the average.
Table- 2 below further illustrates the spread of the data by presenting the minimum and
maximum premium values, along with the 25th, 50th (median), and 75th percentiles. These
percentile values offer insight into how the premiums are distributed across the dataset.
Table-2: Premium Descriptive Statistics by Types of Datasets
Distribution

Count Mean

Bimodal

Std

Min

25%

710,210 2,134.27 624.92

667.72

1,676.59 2,055.63 2,507.66 7,064.41

Normal

731,342 2,128.15 667.28

611.04

1,664.42 2,028.34 2,470.99 9,544.02

Structural

804,741 1,888.55 384.60 1126.39 1,628.33 1,820.99 2,060.41 4,604.17

Uniform

709,817 1,975.75 669.59

532.70

50%

75%

Max

1,481.90 1,879.88 2,372.43 7,530.30

Source: Authors’ calculations

After generating four datasets using bimodal, normal, uniform, and structural approaches, the
four labeled sets were merged into a single comprehensive dataset.
22

Machine Learning Models
Machine learning models were subsequently trained and evaluated using the combined dataset,
with their performance assessed through Root Mean Squared Error (RMSE), MAE, R² and 90%
QAE metrics. To explore a range of predictive capabilities, several algorithms were implemented,
including Linear Regression, Random Forest, XGBoost, LightGBM, and Cat Boost. These models
were selected to provide a diverse comparison between traditional and ensemble-based approaches.
Among popular machine learning algorithms, Random Forest stands out for its ensemble approach,
where multiple decision trees are built using bootstrapped samples and predictions are made by
averaging their outputs. Each tree considers a random subset of features at each split, and growth
continues until a predefined stopping criterion is met, such as a limit on splits or node size
(Zimmerman et al., 2018). Building on the idea of iterative improvement, XGBoost takes a
different route by constructing boosted trees sequentially, with each new tree learning from the
errors of the previous ones. Its use of regularization and focus on scalability make it a widely
adopted choice for large scale tasks (Karakilic, Hatas, & Pacal, 2025). In a similar approach,
LightGBM offers its own innovations, introduced by Microsoft in 2017, including techniques like
gradient-based one-sided sampling and exclusive feature bundling, which help speed up training
while preserving accuracy (Ke et al., 2017). Lastly, Cat Boost brings unique strengths to the table,
especially in handling categorical variables and high-dimensional data. Its consistent performance
across diverse applications from precipitation forecasting to biomass estimation has made it a
reliable option in many domains (Luo et al., 2021; Qian et al., 2021; Samat et al., 2022).
Finally, the compiled dataset was subjected to each model, and results were compared based on
RMSE, MAE, R² and 90% QAE to identify the most accurate predictive approach. Pandas and
numpy were imported for core data wrangling and matplotlib for visualization, and key
components from scikit-learn for splitting the dataset and evaluating performance. For modeling,
a solid lineup was used whereas, linear regression, random forest, Boost, Light, and Cat Boost
were needed to be installed in the base environment. The merged dataset was loaded. The goal was
to work exclusively with complete records to maintain modeling integrity as over filtering for
completeness can introduce bias or shrink the sample size, which may undermine statistical power.

23

For the feature set, things were kept clean by grabbing all numeric columns, then excluding the
premium columns from that list. What’s left forms the input features. Because it is easy to
accidentally include numeric IDs or indices, which could mislead the model if not handled properly.
Therefore, the five models, each one mapped to a short name, each with a fixed random seed for
reproducibility, were instantiated. And default hyperparameters were used so it runs quickly.
The following steps will run a consistent workflow using the same seed for fairness:
- Split the data 80/20 into training and testing sets.
- Fit each model on the training portion.
- Predict on the testing set.
- Calculate RMSE, MAE, R² and 90% QAE (Quantile Absolute Error at the 90th Percentile.
- Store the results (model name, target column, RMSE, MAE, R² and 90% QAE)
Results were shaped into a Data Frame, then pivoted it for cleaner comparison. As shown in Table
-3,4,5 and 6, rows as models, columns as different target generation methods, and values.
Table-3,4,5 and 6 describing the Model Performance Metrics by Premium Type presents a
comparison of five machine learning models of CatBoost, LightGBM, Linear Regression, Random
Forest, and XGBoost evaluated across four premium generation methods such as Bimodal, Normal,
Structural, and Uniform. The models were assessed using the four key metrics:
Root Mean Squared Error (RMSE)

(4)
Where:
𝑦̂𝑖 = predicted value
𝑦𝑖 = actual value
For a sample of n observations y (yi, i = 1, 2, . . ., n) and n corresponding model predictions yˆ.

24

The RMSE as shown in equation (4) has been used as a standard statistical metric to measure
model performance in many areas including meteorology, air quality, and climate research
studies (Hodson, 2022). The second accuracy measurement is Mean Absolute Error (MAE)
(5)
MAE is a useful measure widely used in model evaluation. For a sample of m observations y (yi,
i = 1, 2, . . ., n) and n corresponding model predictions 𝑦̂𝑖 (Hodson, 2022). In this study,
Coefficient of Determination (R²) also used to measure the goodness of it of the estimated models:

∑𝑛 (𝑦 −𝑦̂ )2

𝑖
𝑖
𝑅 2 = 1 − ∑𝑖=0
𝑛 (𝑦 −𝑦
̅ )2
𝑖=0

𝑖

𝑖

(6)

where 𝑦̅𝑖 is the mean of the actual values & R2 represents the proportion of variance in the
dependent variable explained by the model. This metric is commonly employed in evaluating
regression models and is noted for offering greater insight than many alternative measures (Chicco
et al., 2021). The next accuracy measurement is the 90th Percentile Absolute Error (90% QAE):
(7)
where Q0.90 represents the 90th percentile of absolute errors. This metric is a quantile-based
dispersion measure, conceptually similar to the Quantile Absolute Deviation (QAD) framework
(Akinshin, 2022).

While the machine learning models in this study produce continuous outputs in the form of
predicted pure premiums, classification-based metrics were also employed to evaluate their
effectiveness in identifying high-risk territories. This dual evaluation framework allows for a
comprehensive assessment of both numerical prediction accuracy using RMSE, MAE, and R²
25

and categorical classification accuracy, which is critical for premium zoning and decisionmaking.
To facilitate classification analysis, continuous predictions were converted into discrete risk
categories for Multi-class Classification, for AUC, Accuracy, Precision, Recall and F1 Score.
Territories were divided into Low, Medium, and High-risk groups by segmenting the actual
premium distribution into equal quantile ranges. Predicted premiums were discretized using the
same quantile-based boundaries to ensure consistent categorization.
The Receiver Operating Characteristic (ROC) curve is a a way to visualizes the performance of
a binary classifier at various threshold settings. ROC plots the True Positive Rate also known as
Sensitivity or Recall against the False Positive Rate.
The AUC is the area under the ROC curve. It provides a single numerical value that indicates
the overall ability to distinguish between different clusters. For instance,
While AUC = 1 indicates a perfect classifier, AUC = 0.5 shows that the classifier performs no
better than random guessing. Other performance accuracy measures were also calculated. The
most basic metric represents the proportion of correct predictions out of all predictions. It's
calculated as (True Positives + True Negatives) / (Total Predictions).
(9)

where TP is the number of true positives, TN is the number of true negatives, FP is the number
of false positives, and FN is the number of false negatives.
Measures the accuracy of positive predictions. It indicates the proportion of correctly predicted
positive instances out of all instances predicted as positive. Calculated as True Positives / (True
Positives + False Positives).
as in equation (10) below:
(10)

26

Sokolova and Lapalme (2009) describe Recall as the average ability of a classifier to correctly
identify class labels, measured separately for each class. In simple words, Recall measures the
ability of the model to identify all relevant positive instances. It indicates the proportion of
correctly predicted positive instances out of all actual positive instances. Calculated as True
Positives / (True Positives + False Negatives).
(11)
The last accuracy measurement is F1-score as in equation (12) below:
(12)
As Sharma (2023) describes F1 score as the harmonic means of precision and recall, providing a
balanced measure that considers both false positives and false negatives. It's particularly useful
for imbalanced datasets where high precision or recall alone might be misleading.
Together, these metrics offer a balanced view of prediction accuracy, model fit, and sensitivity to
larger errors.
Among the four premium types, the Structural dataset consistently leads to the strongest model
performance. It yields lower RMSE and MAE values and significantly higher R² scores across all
models, indicating a clearer and more learnable structure. CatBoost performs best under this dataset,
achieving an RMSE of 158.31 CAD, an MAE of 131.38 CAD, and an R² of 0.85, suggesting it
explains 85% of the variance in premium predictions. Comparable results from XGBoost and
LightGBM further reinforce the effectiveness of structure preserving data.
On the other hand, models trained on Bimodal and Normal distributions show weaker performance,
with higher error rates and R² values typically ranging from 0.20 to 0.24. These results imply that
these distributions lack the underlying structure needed for accurate prediction. While Random
Forest performs reasonably well on Structural data, it tends to lag slightly behind the gradient
boosting models overall.
The 90% QAE metric adds another layer of insight by highlighting worst case prediction errors.
27

For example, under the Structural dataset, 90% of Cat Boost’s absolute errors fall below 248.54
CAD, whereas in the Uniform dataset, this threshold increases sharply to 987.81 CAD, indicating
greater volatility and less reliability.
The results suggest that Structural data offers meaningful patterns that all models, especially treebased ones like CatBoost and XGBoost can effectively learn. Gradient boosting models
consistently outperform Linear Regression and show greater robustness across different data types.
Additionally, the 90% QAE metric proves valuable in assessing prediction reliability in higher risk
scenarios, which is particularly relevant for insurance pricing applications.

Table-3: Model Performance Metrics by Premium Type- RMSE
Model
CatBoost
LightGBM
Linear Regression
Random Forest
XGBoost

Bimodal
552.08
551.86
573.27
598.71
552.61

Normal
625.02
625.27
644.85
672.48
625.39

Structural
158.31
159.64
233.07
168.28
158.46

Uniform
620.76
620.82
637.18
672.94
621.68

Source: Authors’ calculations

Table-4: Model Performance Metrics by Premium Type, MAE
Model
CatBoost
LightGBM
Linear Regression
Random Forest
XGBoost

Bimodal
438.75
438.70
453.86
472.18
439.04

Normal
456.32
456.31
471.80
495.79
456.72

Structural
131.38
132.72
184.66
137.39
131.44

Uniform
496.23
496.38
507.58
533.58
496.90

Source: Authors’ calculations

28

Table-5: Model Performance Metrics by Premium Type, R2
Model
CatBoost
LightGBM
Linear Regression
Random Forest
XGBoost

Bimodal
0.24
0.24
0.18
0.11
0.24

Normal
0.20
0.20
0.15
0.08
0.20

Structural
0.85
0.84
0.67
0.83
0.85

Uniform
0.18
0.18
0.14
0.04
0.18

Note. Values represent accuracy scores across different data distributions.
Source: Authors’ calculations

Table-6: Model Performance Metrics by Premium Type, 90% QAE
Model
CatBoost
LightGBM
Linear Regression
Random Forest
XGBoost

Bimodal
883.10
881.92
914.82
963.86
884.00

Normal
916.33
914.72
945.50
1021.55
917.75

Structural
248.54
249.30
358.36
272.26
249.06

Uniform
987.81
988.63
1010.06
1083.75
990.07

Note. Values represent accuracy scores across different data distributions.

Source: Authors’ calculations

Table-7: Model Performance Metrics by Premium Type, AUC

Source: Authors’ calculations

Table-8: Model Performance Metrics by Premium Type, Accuracy

Source: Authors’ calculations

29

Table-9: Model Performance Metrics by Premium Type, Precision

Source: Authors’ calculations

Table-10: Model Performance Metrics by Premium Type, Recall

Source: Authors’ calculations

Table-11: Model Performance Metrics by Premium Type, F1 Score

Source: Authors’ calculations

Table-5 presents a comparison of model performance across four dataset types: structural,
bimodal, normal, and uniform. Models trained on structural data consistently delivered the
highest predictive accuracy, with R² values ranging from approximately 0.67 to 0.85. In contrast,
models using bimodal and normal datasets showed significantly weaker performance, with R²
values between 0.11 and 0.24. Uniform data produced results that were nearly random, with some
R² values approaching 0.04.
These outcomes underscore a key insight: only the structural dataset effectively captures the
underlying pricing logic required for actuarially sound predictions. The other synthetic
distributions lack the complexity and realism needed to support meaningful premium modeling.

30

Table-6,7,8,9,10, & 11 reveals clear variations in model performance across the four data
generation methods. On the structural dataset, all models delivered significantly stronger results,
with AUC scores ranging from 0.87 to 0.92. LightGBM and XGBoost slightly outperformed the
others, each achieving AUCs above 0.68 and balanced precision, recall, and F1-scores near 0.64,
indicating their superior ability to distinguish between low-, medium-, and high-risk categories
when the data contains strong structural patterns. Linear Regression also performed well, posting
the highest recall (0.8016), though its F1-score was marginally lower than those of the tree-based
models.
In contrast, the bimodal and normal datasets offered only moderate classification power. AUC
values fell within the 0.68–0.90 range. Again, LightGBM and XGBoost led the pack, consistently
outperforming Random Forest and Linear Regression with slightly better precision and recall.
The top models recorded F1-scores around 0.79–0.80, reflecting a modest but reliable ability to
manage false positives and false negatives.
The uniform dataset posed the greatest challenge. AUC values declined further to 0.68–0.69.
Precision and recall deteriorated across all models, and F1-scores dropped below 0.42, with
Linear Regression performing the worst (F1 = 0.65). These results underscore the difficulty of
classifying risk when the data lacks inherent structure.
Therefore, model performance is highly sensitive to the nature of data distribution. Structural
datasets consistently produced the strongest results, enabling LightGBM and XGBoost to excel in
risk classification. Meanwhile, bimodal and normal datasets allowed for only moderate
separation, and the uniform dataset resulted in the weakest performance across all metrics.
Overall, gradient boosting methods demonstrated greater robustness than Random Forest and
Linear Regression, but their effectiveness was closely tied to the presence of meaningful structure
in the data.

31

Table-12: Threshold values for each ML model

Source: Authors’ calculations

MC_q1 and MC_q2 in Table-12 represent the two threshold values used to divide predicted
premium amounts into three distinct risk categories—low, medium, and high—for multi-class
evaluation. These thresholds are derived exclusively from the training data to avoid any influence
from the test set. By default, MC_q1 corresponds to the value below which one-third of the training
premiums fall, while MC_q2 marks the point below which two-thirds of the premiums fall.
Predictions below MC_q1 are classified as low risk, those between MC_q1 and MC_q2 as medium
risk, and those above MC_q2 as high risk. This categorization enables the conversion of
continuous premium predictions into discrete classes, allowing the evaluation of a model’s ability
to classify territories by risk level in addition to predicting precise premium values.
Figure-3 below shows a visualized presentation with two bar plots one for RMSE and one for R² .
Bar charts deliver clarity fast, short bars in the RMSE plot mean lower error (better) and tall bars
in the R² plot mean stronger fit.

32

Figure-3: RMSE by Regression Model and Dataset

Source: Authors’ calculations

Lower value of the Root Mean Squared Error (RMSE) indicates predictions closer to actual
premiums. Across all five algorithms: CatBoost, LightGBM, Linear Regression, Random Forest,
and XGBoost, the structural dataset consistently yielded the lowest RMSE values, ranging from
150 to 230. This demonstrates that structural data generation provides a clear, learnable signal. In
contrast, the remaining three datasets produced substantially higher errors (RMSE between 550
and 720), rendering them unsuitable for precise premium estimation. Among these weaker options,
the bimodal dataset achieved the best performance (approximately 580–630), followed by the
normal distribution dataset (approximately 600–660), with the uniform dataset performing worst
(approximately 650–720). These results indicate that only the structural dataset supports
meaningful predictive modeling, while the other approaches fail to deliver sufficiently accurate
signals for premium prediction.

33

Figure-4: R2 by Regression Model and Dataset

Source: Authors’ calculations

Regression Model explanatory power was assessed using the coefficient of determination (R²) as
in figure 4, where higher values indicate a greater proportion of premium variance captured. Across
all five algorithms: CatBoost, LightGBM, Linear Regression, Random Forest, and XGBoost the
structural dataset consistently achieved R² values between 0.85 and 0.88, explaining roughly 85–
88% of the variance in premiums.
By contrast, the remaining datasets exhibited minimal explanatory capacity. For bimodal data, the
models achieved R² values between 0.20 and 0.25, meaning they were able to explain only about
one fifth of the variation in the premiums. Predictions on normal and uniform data were similarly
limited, with R² values hovering around 0.15 to 0.22. Notably, when using Random Forest on
uniform data, the R² dropped below zero—signaling that the model performed worse than simply
predicting the average premium for all cases, a clear indication of poor predictive power in that
scenario.

34

These results confirm that only the structural data encodes the true pricing relationships, whereas
the bimodal, normal, and uniform approaches fail to provide a meaningful signal for the models to
learn from.
Figure-5: MAE by Regression Model and Premium Type Dataset

Source: Authors’ calculations

Figure-5 presents the Mean Absolute Error (MAE) for five machine learning models of
CatBoost, LightGBM, Linear Regression, Random Forest, and XGBoost evaluated across four
premium distributions such as Bimodal, Normal, Structural, and Uniform. Results show that the
Structural dataset generally leads to lower MAE values across all models, suggesting that it
contains more consistent patterns that models can learn effectively. For instance, CatBoost and
XGBoost report MAEs around 130–135 CAD for Structural premiums, which are notably lower
than those observed for other distributions. In comparison, Bimodal, Normal, and Uniform
datasets tend to produce higher MAEs, typically in the range of 440 to 500 CAD indicating that
these distributions may be more difficult to model accurately.

35

Overall, tree based boosting models such as CatBoost, XGBoost, and LightGBM perform better
than Linear Regression and Random Forest across most premium types, with particularly strong
results on the Structural dataset. Random Forest shows relatively higher MAE even on Structural
data, suggesting it may be less well suited for this specific modeling task. The chart highlights how
structured input data can improve model performance and how MAE serves as a useful metric for
comparing predictive accuracy across different modeling approaches.
Figure-6: 90% QAE by Regression Model and Dataset

Source: Authors’ calculations

Figure-6 illustrates the 90th Percentile of Absolute Error (90% QAE) for five machine learning
models across four premium generation methods. The y-axis represents the error threshold in
Canadian dollars (CAD) below which 90% of each model’s absolute prediction errors fall. The
Structural dataset consistently results in the lowest 90% QAE values across all models, suggesting
that its underlying structure supports more reliable predictions, even in higher error scenarios.
CatBoost and XGBoost maintain 90% QAE values below 250 CAD for Structural premiums,
which is lower than those observed for other distributions. In contrast, Bimodal, Normal, and
Uniform datasets tend to produce much higher 90% QAEs typically ranging from 850 to 1,000

36

CAD indicating that models face greater difficulty managing outlier errors in these less structured
datasets.
Among the models, Random Forest shows the highest 90% QAE values, especially on Uniform
and Normal data, where errors exceed 1,000 CAD. This contributes to increased volatility and
reduced reliability in tail cases. Gradient boosting models such as CatBoost, XGBoost, and
LightGBM generally perform better than Linear Regression and Random Forest in limiting large
errors, particularly when trained on structured data.
Overall, the 90% QAE metric provides valuable insight into model robustness, where large
prediction errors can carry significant financial implications. The results suggest that structured
data supports more stable predictions, and that tree-based ensemble methods, especially CatBoost
and XGBoost are well suited for managing risk in high-error scenarios.
Figure-7: AUC by Regression Model and Premium Type Dataset

Source: Authors’ calculations

37

The AUC results clearly demonstrate that the structural dataset consistently enabled the highest
class separability across all models, with scores exceeding 0.84 and peaking at 0.883 for XGBoost.
In contrast, the normal, bimodal, and uniform datasets yielded significantly lower AUCs, clustering
between 0.63 and 0.68. These findings suggest that when data possesses strong inherent structure,
models can effectively differentiate between risk categories. However, in less structured
distributions, their discriminative power diminishes considerably.
Figure-8: Accuracy by Regression Model and Premium Type Database

Source: Authors’ calculations

Accuracy trends mirrored the AUC results. The structural dataset again delivered superior
performance, with all models achieving accuracies above 0.72, and LightGBM reaching a high of
0.741. Conversely, accuracy dropped to approximately 0.43–0.48 for the normal, uniform, and
bimodal datasets, indicating limited classification success under weaker data signals. Tree-based
models slightly outperformed Linear Regression in these more challenging scenarios, though the
margins were modest.

38

Figure-9: Precision by Regression Model and Premium Type Database

Source: Authors’ calculations

Precision scores further reinforce the advantage of structured data. The structural dataset enabled
the highest precision, ranging from 0.748 with RandomForest to 0.784 with Linear Regression.
This means that under structured conditions, models were usually correct when flagging high-risk
territories. For the other datasets, precision hovered around 0.52–0.55 for boosting models and
dipped to ~0.49 for RandomForest, reflecting a higher incidence of false positives.

39

Figure-10: Recall by Regression Model and Premium Type Database

Source: Authors’ calculations

Recall was also strongest with the structural dataset, reaching 0.740 for both LightGBM and
XGBoost. These models successfully identified most true high-risk cases. In contrast, recall
dropped to ~0.43–0.47 for the normal, bimodal, and uniform datasets, indicating that many true
positives were missed when the data lacked structure.

40

Figure-11: F1- Score by Regression Model and Premium Type Database

Source: Authors’ calculations

The F1-score, which harmonizes precision and recall, offers the most comprehensive view of
model performance. The structural dataset again led the way, with scores between 0.72 and 0.74,
reflecting a strong balance between detecting true risks and minimizing false alarms. On the other
datasets, F1-scores fell sharply to the 0.41–0.48 range, with Linear Regression performing worst
on uniform data (0.346). These results highlight the critical role of data structure in achieving
reliable predictive outcomes.
Sensitivity Analysis
After generating premium datasets using four distribution methods of Bimodal, Normal, Uniform,
and Structural, a sensitivity analysis was carried out to examine how changes in input parameters
affect model predictions. The ma
in goal was to understand how small adjustments, particularly to the Combined Driver Factor
(CDF), influence final premium estimates. To reflect real world variability, selected rating factors
were modified by ±10%, allowing the study to observe how the model responds to controlled
changes. This approach is consistent with practices in optimization research, where slight shifts in
cost coefficients help assess model stability (Andersen et al., 2025). As noted by Zheng et al.

41

(2025), sensitivity analysis is a useful tool for identifying which inputs have the greatest impact
and for evaluating the reliability of pricing models.
In addition to these tests, analysis was used to explore linear relationships among features across
the four datasets. Numeric variables were filtered and combined into a single Data Frame with
distribution labels. A heatmap created with Seaborn helped visualize these relationships, showing
which features tend to move together and offering insight into possible multicollinearity or
structural differences. While correlation analysis does not capture non-linear effects or sensitivity
directly, it provided helpful context before conducting more targeted perturbation tests.
Overall, sensitivity analysis supports model validation and helps improve transparency in decision
making, especially in areas like insurance pricing. By adjusting key parameters, analysts can
identify points where the model’s behavior changes noticeably, revealing both stable and sensitive
regions. Combining sensitivity testing with exploratory tools like correlation mapping offers a
more complete view of model performance and reliability under varying conditions.

42

Figure-12: Sensitivity Analysis Visualization

Source: Authors’ calculations

The influence of each component on the overall premium was quantified by examining its
coefficient in the premium formula. The coefficients were then ranked to determine which factors
exerted the greatest upward or downward pressure on cost.

43

Table-13: Factor impact on Sensitivity Analysis
Factor
UDAP (Unlisted Driver
Accident Premium)

Coefficient Impact Description
Largest positive effect: adding accident coverage for
+0.65

Base Rate Premium

+0.46

DDF (Driver Discount Factor)

–0.63

CDF (Combined Driver
Factor)
TF (Transition Factor)
ASTF (Advanced Safety Tech
Factor)
UDPP (Unlisted Driver
Protection Premium)

+0.38

+0.25

+0.08

+0.08

unlisted drivers raises premiums
Reflects base cost based on vehicle type and
location
Largest discount: safe driving or low mileage
significantly reduces premiums
Moderate increase: linked to recent violations,
tickets, or claims
Smaller impact: captures rate increases in high
traffic or busy areas
Minor effect: extra safety gear has limited pricing
influence
Minor effect: basic add-ons offer minimal premium
changes

Source: Authors’ calculations

Table-7 reveals a clear hierarchy in terms of pricing sensitivity, with some factors exerting strong
upward or downward pressure on premiums, while others have only marginal effects. The Unlisted
Driver Accident Premium (UDAP) emerged as the most influential factor, with a coefficient of
+0.65. This indicates that adding accident coverage for drivers not listed on the policy significantly
increases the premium. It reflects the insurer’s heightened risk exposure when covering individuals
whose driving history may be unknown or unverified.
Next in line is the Base Rate Premium, which showed a coefficient of +0.46. This component
captures the foundational cost of coverage, determined primarily by vehicle type and geographic
location. It serves as the starting point for premium calculations before additional risk factors are
applied.

44

On the discount side, the Driver Discount Factor (DDF) stood out with a substantial negative
coefficient of –0.63. This factor rewards safe driving behavior and low-mileage usage, offering
significant reductions in premium for qualifying drivers. It underscores the insurer’s incentive to
promote responsible driving habits.
The Combined Driver Factor (CDF) contributed a moderate increase to premiums, with a
coefficient of +0.38. This factor is closely tied to driver history, particularly recent violations,
tickets, or claims. It reflects the elevated risk associated with drivers who have demonstrated
problematic behavior on the road. The Transition Factor (TF) had a smaller but still notable impact,
with a coefficient of +0.25. This factor captures rate adjustments in high-traffic or densely
populated areas, where accident frequency and claim severity tend to be higher.
Finally, both the Advanced Safety Tech Factor (ASTF) and the Unlisted Driver Protection
Premium (UDPP) showed minimal influence on pricing, each with coefficients around +0.08.
These results suggest that while safety enhancements and basic add-ons are considered in pricing,
their effect on the overall premium is relatively minor.
Therefore, the premium structure is most sensitive to unlisted driver coverage (UDAP) and base
rate settings, while safe driving behavior offers the most substantial discounts. Other factors, such
as traffic density and driver history, play a moderate role, and technological or protection add-ons
contribute only marginally to pricing adjustments.

Clustering Approach
Clustering techniques, including K-means, DBSCAN, and hierarchical clustering, were employed
to explore the development of new insurance zones. This analysis aligns with the second research
question, which examines how algorithmic methods can be applied to assess and reconfigure
premium zones based on underlying risk factors.

45

K-Means
K-Means is an unsupervised learning algorithm used to uncover natural groupings in data. It works
by initially placing k randomly chosen cluster centers (centroids), assigning each data point to the
nearest one, then recalculating the centroids as the mean of their assigned points. This cycle
continues iteratively until the centroids stabilize and no longer move (Masruroh et al., 2023). To
ensure spatial contiguity, these cluster labels needed to be merged with the geographic polygons
and apply a graph-based or region-growing method. BC Data Catalogue was checked but doesn't
seem like there is ICBC territory BC Polygon Layer. Therefore, to get going, as a workaround was
decided, to get the BC Polygon Layer from BC Data catalogue and then merge labels back into
geodata frame. So, territories geojson file (map of territories) was needed to create the Territory
clusters file. Once BC polygon layer was ready, it needed to be reaggregated into the 14 ICBC
territories and then merged to get ICBC territory BC Polygon layer file. K-Means initially applied
followed by the elbow and silhouette plots to visually determine an optimal K.
The elbow method analyzes how the number of clusters affects the variance captured within a
dataset. It runs K-means clustering across a range of K values, calculating the inertia (or within
cluster sum of squares) for each configuration. The goal is to identify a point—often resembling an
"elbow" in the plot—where adding more clusters yields diminishing returns in explained variance
(Kaur & Saini, 2022).
Elbow Method was chosen because it offers a clear, intuitive way to identify the point at which
adding more clusters no longer improves model insight. Beyond this threshold, further
segmentation only adds complexity, splitting regions into smaller groups without yielding
meaningful gains. By stopping at the optimal number of clusters, it helps maintain a streamlined
model while still capturing the most relevant regional crash risk patterns.
Silhouette analysis evaluates how effectively data points are grouped by measuring two aspects:
how similar a point is to others in its assigned cluster, and how dissimilar it is from points in
neighboring clusters. For each number of clusters (K), an average silhouette score is computed
across all data points to assess clustering quality (Kumar, Rani, Pippal, & Agrawal, 2025).
46

Figure-13 presents the result of the Elbow method and Silhouette Analysis.

Figure-13: Elbow Method and Silhouette Analysis

Source: Authors’ calculations

Elbow Method for each k (2 through 10), it plots the total “inertia,” i.e. the sum of squared distances
of each point to its cluster center. And lower inertia means points are closer to their cluster center
47

(tighter clusters). Therefore, as you increase k, inertia always drops, but after a certain point the
improvement is flattening out. Thus, around k=4, you see a noticeable “bend” (or elbow). Beyond
4, adding more clusters only marginally reduces inertia, so 4 gives us most of the benefit without
over-splitting. Silhouette Analysis (bottom plot) for each k, it computes the average silhouette
score, a number between –1 and +1 that measures how well each point fits its own cluster versus
the nearest other cluster. Higher silhouette means clearer, more distinct clusters. Thus, the
silhouette score is highest at k=4, clearly indicating that’s where clusters are most internally
cohesive and well separated.
Guided by the elbow method and silhouette analysis, the K-means algorithm groups customers
into K clusters according to similarities in their search behavior. Each customer is assigned to a
cluster that best reflects their interaction patterns (Kumar et al., 2025). And since both plots show
k=4, 4 clusters seem the sweet spot for the data. Therefore, after selecting K, the model was refitted on the final dataset and assigned the resulting labels back to the territory index and merged
these cluster labels with the geographic polygons.
Figure-14: BC ICBC Territories by Dominant KMeans_4 Cluster

48

Source: Authors’ calculations

This is a choropleth of BC’s 14 ICBC territories, each shaded by its dominant K-Means cluster (k
= 4, but only clusters 1 and 2 ever “won”). Here’s what it tells us:
Cluster 1: o

Territories F, G and H ended up in this group:
▪

F – Squamish / Whistler Area

▪

G – Pemberton Area / Hope Area

▪

H – Fraser Valley

o These three coastal/mountain‐corridor zones share similar premium-profile

characteristics (higher estimated premiums or different crash patterns), which set
them apart in our four-cluster K-Means model.
Cluster 2:
49

Every other territory fell into this second cluster
▪

D (Lower Mainland), E (Maple Ridge / Pitt Meadows), L (Thompson /
Okanagan), N (Kootenays), P (Cariboo), R (Prince George), S (Northern
Coast), V (Peace River Area), W & X & Y (Vancouver Island regions),
and even Z (outside BC).

The key insight from this map is that the southern mountain corridor territories specifically
regions F, G, and H exhibit characteristic crash metrics that differ noticeably from both the urban
Lower Mainland areas (D and E) and the more rural or coastal zones. These distinctions are
significant enough to justify grouping F/G/H into their own cluster.
In contrast, the remaining regions across the province display broadly similar patterns in average
premiums and crash metrics and thus have been grouped together into Cluster 2. This
visualization effectively highlights where the outlier premium-risk profiles are concentrated,
while also illustrating that the rest of the province behaves relatively uniformly within the
KMeans clustering framework.
Density-Based Spatial Clustering of Applications with Noise
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a robust clustering
algorithm originally (Ester, Kriegel, Sander, & Xu, 1996), designed to handle large spatial
datasets. It operates by identifying regions of high data density and forming clusters accordingly,
while separating sparser areas that may contain anomalies. Through this process, dense clusters
of typical data are filtered out, leaving behind a focused set of data points more likely to
represent irregular or anomalous behavior ideal for deeper inspection.
The DBSCAN clustering algorithm identifies data points based on density within the feature
space and plays a central role in the pruning process. By detecting dense regions and isolating
noise, DBSCAN helps filter out irrelevant or misleading anomaly scores, enhancing the precision
of anomaly detection (Firdaus & Suryani, 2024).
In our case, K-Distance plot with same = 100,000 was ran as this process began by randomly
sampling up to 10,000 rows to ensure computational efficiency. For each sample point, 5
distances representing the distance to its fifth nearest neighbor were calculated as shown in
50

Figure 15. These distances were then sorted and plotted, providing a visual representation of
their distribution. A dashed line was added at the 95th percentile to serve as a reference point,
helping to identify potential outliers or threshold values for clustering or anomaly detection. re-9:
Figure-15: K- Distance Plot (Sample = 100000)

Source: Authors’ calculations

To determine an appropriate value for ε in the DBSCAN clustering algorithm, the elbow of the kdistance curve (or alternatively, the 95th percentile line) was used as a guide as in Figure-15. The
curve, generated using k=5, showed that distances remained relatively flat until around 0.35, after
which they began to rise sharply, indicating a natural inflection point. Based on this observation,
the ε value was set to 0.35, which also corresponds to the 95th percentile of the distance distribution.
The minimum number of samples required to form a dense region was set to 5.
Now need to run DBSCAN per ICBC territory and Merge into the custom ICBC BC GeoJSON
and plot the choropleth. The Python script kept on crashing, so it was divided into two chunks.
Chunk one would deal with DBSCAN, and second chunk would plot the BC Map. But then even

51

with Chunk 1, it ran into RAM issues as it’s very tough to run DBSCAN on millions of crash level
rows, and since in the end only a territory‐level cluster map was needed.
Based on the structure of the data and the clustering objective, the following approach was adopted.
First, the dataset was streamed in manageable chunks to compute the mean values of the four
premium features for each of ICBC’s 14 territories, resulting in a compact summary with one line
per territory. Next, DBSCAN was applied to these 14 mean vectors using the previously
determined parameters. This clustering step assigned each territory to a distinct group based on its
average premium characteristics. Finally, the resulting clusters were mapped onto the BC
GeoJSON file, following the same geographic visualization method used in earlier steps.
In chunk 1 script, it never holds all 3.5 M records in memory at once only 500 k at a time then
reduces to 14 lines and runs DBSCAN instantly. But on running, it turns out that using ε = 0.35
(picked from the crash‐level k-distance plot) on the 14 territory-mean vectors is far too small every
territory comes back as “noise” (label –1) as shown in Figure-16. That’s because the scale and
spacing of the 14 aggregated points are very different from the 3 million+ crash level points we
sampled earlier.

Figure-16: DBSCAN Noise Territory Cluster (-1)

52

Source: Authors’ calculations

Therefore, we need to pick ε for the territory-means themselves. To do that, we need to:
1. Compute a k-distance plot on the 14×4 table of (normal/structural/bimodal/uniform)
means.
2. Choose ε at the “elbow” of that small plot (or at a high percentile).
3. Rerun DBSCAN with that new ε and see which territories group together. After rerunning it, new Territory Level K Distance Plot was plotted as shown in figure 12 below.

Figure-17: Territory-Level K-Distance Plot
53

Source: Authors’ calculations

Clearly the “elbow” in the territory‐level plot appears right around the jump from ~1.1 up to ~1.5
(that big jump at the 8–9th territory). Picking ε = 1.5 seemed reasonable (with min samples=5 as
before) and rerun DBSCAN now on the 14 mean‐vectors. And results were better than last time, as
shown below, from the 14-territory run with ε=1.5 and min samples=5, we have:
•

Cluster 0: 12 of the 14 territories (D, E, H, L, N, P, R, S, V, W, X, Z)

•

Noise (–1): Territories F and G

In simple words, the DBSCAN algorithm sees F (Squamish/Whistler) and G (Pemberton/Hope)
as “outliers” relative to the rest of BC when looking at those four mean premium values much
like they stood out in K-Means, but here they’re treated as noise rather than forming their own
cluster. All the other territories fall into a single dense group (cluster 0).

Figure-18: DBSCAN updated Territory Cluster
54

Source: Authors’ calculations

To visualize the DBSCAN results, the 14-row territory mean vectors each assigned a DBSCAN
cluster label were merged with the simplified set of 14 ICBC territorial polygons. This combined
dataset was then used to generate a choropleth map, illustrating in Figure 19 the spatial
distribution of clusters across British Columbia.

55

Figure-19: BC ICBC Territories by DBSCAN

Source: Authors’ calculations

DBSCAN algorithm sees F (Squamish/Whistler) and G (Pemberton/Hope) as “outliers” relative
to the rest of BC therefore considering them as noise. And rest as cluster in same pattern.

Hierarchical Clustering
Hierarchical clustering is an unsupervised machine learning method for customer segmentation
involves constructing a dendrogram. It is a branching diagram that reveals how data points are
related. This visual structure emerges from hierarchical clustering and helps uncover natural
groupings within the dataset (Roux, 2018).
56

As Pramono, Surjandari, & Laoh, 2019, describes that agglomerative hierarchical clustering is
applied to the prepared dataset to evaluate its results alongside those from K-means. Using Ward’s
minimum variance method which combines Ward’s linkage with Euclidean distance, the algorithm
constructs clusters by iteratively merging groups that result in the smallest increase in overall
variance. This approach ensures that each resulting cluster maintains high internal consistency,
with its members being closely related.
Figure-20: Hierarchical Clustering Dendrogram

Source: Authors’ calculations

Figure-20 describes that the dendrogram has two clear jumps:
1. Largest jump at ~9 – if you cut here, you get 2 clusters:
o

Cluster A: G, F, H, S, R

o

Cluster B: D, E, L, N, P, V, W,

X, Z
2. Next jump at ~5 – if you cut here, you get 3 clusters:
o

Cluster 1: G, F, H, S

o

Cluster 2: R, D, V, W, Z

o

Cluster 3: E, L, P, N, X

We select K=3 as a middle group for the big interior/north-coast block. Ward’s agglomerative
clustering was run on the territory‐level means to create 3 clusters by joining those labels back to
57

the custom ICBC BC GeoJSON and rendering a clean choropleth with territory codes at their
centroids
Figure-21 : BC ICBC Territories by Hierarchical Clusters

Source: Authors’ calculations

Figure-21 clearly described that among the clustering methods evaluated, hierarchical clustering
has clearly yielded the most favorable results to date. While alternative techniques have offered
varying levels of performance, hierarchical clustering has consistently demonstrated superior
effectiveness and interpretability in this analysis as it was able to dissect the data into three
effective clusters.
Conclusion
The central idea of this paper was to examine whether machine learning algorithms could
be used to enhance insurance premium zoning by leveraging regional crash data and reconstructing
58

risk-based pricing schemes. Through the generation of estimated premiums using four data
distributions, combined with clustering techniques, regression analysis, and sensitivity analysis,
this study explored patterns in geographic risk and assessed how stable premium predictions
remain when inputs are varied. The results suggest that using a structured algorithm, risk informed
data can lead to more accurate and reliable pricing outcomes, offering support for updating
traditional territory-based insurance zoning with more data-driven methods.
The findings of this study carry several implications for ICBC’s current premium policy approach,
suggesting opportunities for refinement and modernization. At present, ICBC relies on fixed
geographic boundaries and standardized rating tables that may not fully capture the diversity of
risk profiles across British Columbia. The clustering analysis reveals that certain regions such as
Squamish/Whistler and Pemberton/Hope exhibit crash severity and premium patterns that diverge
from provincial norms. This suggests that the existing territorial zoning may benefit from reevaluation, with the potential to redraw boundaries consistent with the results of such machine
learning algorithm so that premiums more accurately reflect localized risk.
The sensitivity analysis further indicates that some rating factors, particularly the crash-type factor
(CDF) and the unlisted-driver penalty (UDAP) have a disproportionate influence on premium
outcomes. These findings imply that ICBC could achieve greater precision and fairness by
focusing its calibration efforts on these high impact variables. Conversely, factors that
showed minimal influence, such as ASTF and DDF, might be simplified or removed, reducing
complexity in the rating structure without significantly affecting pricing accuracy. Another
implication arises from the identification of outlier territories through DBSCAN clustering. These
regions do not conform to broader patterns and may not be well served by a uniform pricing
approach. ICBC could consider developing tailored strategies for these areas, such as customized
renewal programs or targeted communications, to better address their distinct risk characteristics
and avoid adverse selection.
The study also highlights the potential value of machine learning in premium setting. The strong
performance of gradient boosted tree models suggests that nonlinear interactions among risk
factors can offer predictive insights beyond traditional factor tables. This points to the possibility
59

of piloting an enhanced ML pricing engine that updates regularly with new claims data, allowing
ICBC to respond more dynamically to changes in driving behavior, infrastructure, and emerging
risks.
Finally, the study underscores the importance of transparency in building public trust. By sharing
the rationale behind clustering methods, sensitivity results, and model performance, ICBC could
help policyholders better understand how their premiums are determined. This could improve
confidence in the system and clarify which factors are within a driver’s control versus those shaped
by external conditions.
Therefore, ICBC’s current premium policy approach could be strengthened by adopting more
granular, data-driven methods. These changes have the potential to improve fairness,
responsiveness, and transparency, while better aligning premiums with actual risk across the
province.
Thus, in conclusion, a robust end to end pipeline was developed to understand and model
contraventions and evaluating the zones across British Columbia. From cleaning crash records and
assigning territories to estimating premiums using four approaches, a foundation for data driven
insurance modeling was built. Our clustering methods revealed consistent geographic patterns,
particularly highlighting territories F and G as outliers. Sensitivity analysis and feature correlations
deepened our understanding, while benchmarking regression models confirmed that tree-based
techniques, especially XGBoost are best suited for these engineered features. And hierarchical
Clustering had best results with three clusters.
Future Scope and Limitations
The implementation of advanced machine learning models presents a promising direction
for improving ICBC’s current premium setting scheme. These models particularly gradient boosted
tree algorithms such as XGBoost, LightGBM, and CatBoost offer the ability to capture complex,
non-linear relationships among crash severity, territorial risk, vehicle type, and other variables.
This enables more accurate risk stratification within existing zones, allowing for personalized pure
premiums that reduce cross subsidization between low and high-risk drivers. Clustering techniques,
including K-Means and hierarchical clustering, reveal opportunities to redraw territorial
boundaries based on actual claims experience. This could lead to more homogeneous risk pools
60

and enhance the accuracy of base rates. Additionally, DBSCAN analysis identifies outlier regions
that do not conform to broader patterns. These territories could be treated with bespoke rating
strategies or targeted mitigation programs, rather than being forced into generalized tables.
The sensitivity analysis highlights key rating factors such as the crash-type factor (CDF) and
unlisted-driver penalty (UDAP) that have the greatest impact on premium outcomes. Focusing
calibration efforts on these high leverage variables could improve pricing responsiveness while
simplifying less influential components of the rating structure.
A key advantage of the machine learning approach is its capacity for continuous model updating.
Unlike static tables revised annually, the models can be retrained on fresh claims data at regular
intervals, allowing ICBC to respond more dynamically to emerging trends such as changes in
driving behavior, infrastructure developments, or economic shifts. Additionally, tree-based models
offer transparency through feature importance and partial dependence analysis, helping ICBC
explain rate decisions to policyholders and regulators.
However, several limitations must be acknowledged. The current models rely on aggregated
territorial data, which limits spatial resolution and may obscure localized risk patterns. Future
iterations could benefit from finer geographic granularity, rigorous threshold validation, and the
inclusion of contextual variables such as weather conditions, time-of-day effects, and traffic
density. The absence of driver specific crash data also necessitated certain methodological
compromises, constraining the ability to model individual-level risk with precision. Operationally,
integrating machine learning into ICBC’s pricing workflow will require investment in
infrastructure, data governance, and staff training. Regulatory considerations around algorithmic
fairness and transparency must also be addressed to ensure compliance and public trust. Moreover,
the effectiveness of these models depends on the quality and timeliness of input data, underscoring
the importance of robust data pipelines and validation protocols. Despite these challenges, the
approach outlined in this study establishes a replicable framework for data-driven risk pricing and
road safety evaluation across diverse geographic contexts. With continued refinement and careful
implementation, it can help achieve a pathway towards a more adaptive, equitable, and transparent
premium-setting system.

61

References
1. Akinshin, A. (2022). Quantile absolute deviation: A generalization of the median absolute
deviation. arXiv. https://arxiv.org/pdf/2208.13459
2. Andersen, K. A., Boomsma, T. K., Efkes, B., & Forget, N. (2025). Sensitivity analysis of the cost
coefficients in multiobjective integer linear optimization. Management Science,
71(2), 1120–1137. https://doi.org/10.1287/mnsc.2021.01406
3. Chicco, D., Warrens, M. J., & Jurman, G. (2021). The coefficient of determination R² is more
informative than SMAPE, MAE, MAPE, MSE, and RMSE in regression analysis evaluation.
PeerJ Computer Science, 7, e623. https://doi.org/10.7717/peerj-cs.623
4. David, M. (2015). Auto insurance premium calculation using generalized linear models.
Procedia Economics and Finance, 20, 147–156.
5. Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861–
874. https://doi.org/10.1016/j.patrec.2005.10.010
6. Firdaus, G. M., & Suryani, V. (2024). Enhanced pruning process with DBSCAN for attack
detection. In 2024 1st International Conference on Cyber Security and Computing (CyberComp)
(pp. 113–118). IEEE. https://doi.org/10.1109/CyberComp60759.2024.10913747
7. Insurance Corporation of British Columbia. (2020). Basic tariff.
https://www.icbc.com/assets/pa/6PyY5DEcoIT7z2ujjlwb3d/basic-tariff.pdf
8. Goldberger, A. S. (1972). Structural equation methods in the social sciences. Econometrica,
40(6), 979–1001. https://doi.org/10.2307/1913851
9. Haji Mohammad, F. (2023). Insurance premium calculation using machine learning
methodologies (Master's thesis, Carleton University). Carleton University Institutional
Repository.
10. Healy, M., & Westmacott, M. (1956). Missing values in experiments analysed on automatic
computers. Applied Statistics, 5(3), 203–206.
11. Henckaerts, R., Antonio, K., Clijsters, M., & Verbelen, R. (2018). A data driven binning strategy
for the construction of insurance tariff classes. Scandinavian Actuarial Journal,
2018(8), 681–705. https://doi.org/10.1080/03461238.2018.1429300
12. Hodson, T. O. (2022). Root-mean-square error (RMSE) or mean absolute error (MAE): when to
use them or not. GEOSCIENTIFIC MODEL DEVELOPMENT, 15(14), 5481–5487.
https://doi.org/10.5194/gmd-15-5481-2022
13. Karakilic, M., Hatas, H., & Pacal, I. (2025). Open-circuit fault detection in T-type MLI using
XGBoost: A machine learning-based approach. In 2025 7th International Congress on HumanComputer Interaction, Optimization and Robotic Applications (ICHORA) (pp.
1–6). IEEE. https://doi.org/10.1109/ICHORA65333.2025.11017205
14. Kaur, B., & Saini, J. R. (2022). A strategy to identify loyalty using elbow curve method for
customer segmentation. In 2022 IEEE Pune Section International Conference (PuneCon) (pp. 1–
7). IEEE. https://doi.org/10.1109/PuneCon55413.2022.10014742
15. Ke, G. L., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T.-Y.
(2017). LightGBM: A highly efficient gradient boosting decision tree. In Advances in
Neural Information Processing Systems (Vol. 30, pp. 3149–3157). Curran Associates,
Inc.
16. Kowshalya, G., & Nandhini, M. (2018). Predicting fraudulent claims in automobile insurance.
2018 2nd International Conference on Inventive Communication and Computational
Technologies (ICICCT), 1338–1343. IEEE. https://ieeexplore.ieee.org/document/8473034
62

17. Kumar, R., Upadhyay, S., Rakhra, M., Mrsic, L., Prashar, D., & Khan, A. A. (2024). A machine
learning and ratemaking evaluation of four auto insurance pure premium modeling algorithms.
2024 28th International Computer Science and Engineering Conference (ICSEC), 1–8. IEEE.
https://doi.org/10.1109/ICSEC62781.2024.10770731
18. Kumar, S., Rani, R., Pippal, S. K., & Agrawal, R. (2025). Customer segmentation in ecommerce:
K-means vs hierarchical clustering. Telkomnika, 23(1), 119–128.
https://doi.org/10.12928/TELKOMNIKA.v23i1.26384
19. Ling, C. X., Huang, J., & Zhang, H. (2003). AUC: A better measure than accuracy in comparing
learning algorithms. In Proceedings of the 16th Canadian Conference on Artificial Intelligence
(pp. 329–341). https://doi.org/10.1007/3-540-44886-1_25
20. Luo, M., Zhang, R., Yan, X., Ma, J., Huang, J., & Chen, X. (2021). Combination of feature
selection and CatBoost for prediction: The first application to the estimation of aboveground
biomass. Forests, 12(2), 216. https://www.mdpi.com/1999-4907/12/2/216
21. Masruroh, S. U., Fadilah, A. T., Hulliyah, K., Ramadhan, A. F., Putri, R. A., & Irfan, M. A.
(2023). Implementation of the K-means clustering algorithm for targeting ads. Case study: IBM
Watson Analytics car insurance customer data. In 2023 11th International Conference on Cyber
and IT Service Management (CITSM) (pp. 1–3). IEEE.
https://doi.org/10.1109/CITSM60085.2023.10455449
22. Pramono, P. P., Surjandari, I., & Laoh, E. (2019). Estimating customer segmentation based on
customer lifetime value using two-stage clustering method. In 2019 16th International
Conference on Service Systems and Service Management (ICSSSM) (pp. 1–
5). IEEE. https://doi.org/10.1109/ICSSSM.2019.8887704
23. Roux, M. (2018). A comparative study of divisive and agglomerative hierarchical clustering
algorithms. Journal of Classification, 35(3), 345–366. https://doi.org/10.1007/s00357-018-92599
24. Qian, Q., Jia, X., Lin, H., & Zhang, R. (2021). Seasonal forecast of nonmonsoonal winter
precipitation over the Eurasian continent using machine-learning models. Journal of Climate,
34(17), 7113–7129. https://doi.org/10.1175/JCLI-D-21-0113.1
25. Samat, A., Li, E., Du, P., Liu, S., Miao, Z., & Zhang, W. (2022). CatBoost for RS image
classification with pseudo label support from neighbor patches-based clustering. IEEE
Geoscience and Remote Sensing Letters, 19, Article 8004105.
https://doi.org/10.1109/LGRS.2020.3038771
26. Saranya, G., & Pravin, A. (2021). An efficient feature selection approach using sensitivity
analysis for machine learning based heart disease classification. 2021 11th
International Conference on Communication Systems and Network Technologies (CSNT), 539–
542. IEEE. https://doi.org/10.1109/CSNT51715.2021.9509673
27. Selvakumar, V., Satpathi, D. K., Kumar, P. T. V. P., & Haragopal, V. V. (2021).Predictive
modeling of insurance claims using machine learning approach for different types of motor
vehicles. Universal Journal of Accounting and Finance, 9(1), 1–14.
https://doi.org/10.13189/ujaf.2021.090101
28. Sharma, N. (2023, June 6). Understanding and applying F1 score: AI evaluation essentials
with hands-on coding example. Arize AI. https://arize.com/blog-course/f1-score/
29. Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for
classification tasks. Information Processing & Management, 45(4), 427–437.
https://doi.org/10.1016/j.ipm.2009.03.002
30. Węglarczyk, S. (2018). Kernel density estimation and its application. ITM Web of Conferences,
18, 00037. https://doi.org/10.1051/itmconf/20182300037
63

31. Zheng, B., Shui, Q., Luo, Z., Hu, P., Yang, Y., Lei, J., & Yin, G. (2025). Optimization of engine
piston performance based on multi-method coupling: Sensitivity analysis, response surface
model, and application of genetic algorithm. Materials, 18(13), 3043.
https://doi.org/10.3390/ma18133043
32. Zimmerman, N., Presto, A. A., Kumar, S. P. N., Gu, J., Hauryliuk, A., Robinson, E. S., Robinson,
A. L., & Subramanian, R. (2018). A machine learning calibration model using random forests to
improve sensor performance for lower-cost air quality monitoring. Atmospheric Measurement
Techniques, 11(1), 291–313. https://doi.org/10.5194/amt-11291-2018

64

Appendix
A.1.

A.2.

A.3.

65