A Fraud Detection Model for Car Insurance Claims

By Michael Hirschberger

September 19, 2024

Access O'Mentors

Top Data Scientist Mentors from Fortune 500 Companies excited to help you out 1-on-1!

1️⃣ Explore freely →
2️⃣ Apply confidently →
3️⃣ Pay securely →
4️⃣ Book instantly

Find A Mentor Become A Mentor

Fraud Detection in Auto Insurance Claims

Introduction

This analysis explores trends in car insurance claim fraud. A machine learning model was created in order to predict new claims as being fraudulent or not. The dataset consists of 15420 claims from the Jan. 1994 to Dec. 1996, with various features about the claim, including date and location of incidence, driver information, car information, policy information etc. There is also a column FraudFound_P which indicates whether or not the claim was fraudulent.

This model would enable insurance companies to target those claims that are most likely to be fraudulent. By eliminating claims unlikely to be fraudulent, human insurance claims specialists only have to examine those claims that have a high probability of fraud, thus making the determination of fraud in insurance claims a more time- and cost-effective process.

Evaluation Metric	Logistic Regression	Random Forest	CatBoost	XGBoost Classifier
ROC AUC	0.80	0.85	0.97	0.97
PR AUC	0.16	0.28	0.67	0.69

The results show poorer model performance for Logistic Regression and Random Forest and better model performance for CatBoost and XGBoost Classifier.

You can find a copy of my Python scripts for this project in This GitHub Repository

Exploratory Data Analysis

To begin, exploratory data analysis was performed to identify relationships in the data and to ascertain correlations between features and the incidence of fraud. Three relationships in particular demonstrated a clear relationship:

One-on-one Mentorship Data Science and Machine Learning

Data Cleansing

Next, data cleansing was performed to prepare the data for modeling. In order to run the logistic regression and random forest models, categorical data needed to be converted to numerical data. This was achieved using the scikit-learn preprocessing modules OneHotEncoder and OrdinalEncoder.

Modeling

Logistic Regression

The first type of model performed was Logistic Regression. The best performing ROC and PR curves for this model were as follows:

The ROC AUC of 0.80 indicates decent performance of the model, but there is still room for improvement. The PR AUC of 0.156 indicates poor recall.

The feature importance bar chart shown above indicates that Fault_ThirdParty and Fault_PolicyHolder were the two features that were the largest predictors of fraud in the model.

Random Forest

The Random Forest model was performed next. The results were as follows:

The ROC AUC of 0.85 indicates strong performance of the model, but there is still room for improvement. The PR AUC of 0.269 shows an improvement in recall over Logistic Regression. Overall, Random Forest performed better than Logistic Regression in this analysis. One-on-one Mentorship Data Science and Machine Learning

In contrast to Logistic Regression, PolicyNumber was seen to be the most important feature when using the Random Forest model.

CatBoost

CatBoost is a framework that can solve for categorical features. The results for the best performing model were as follows:

These results showed significant improvement from the first two models, both in the ROC Curve and the PR Curve.

Like Random Forest, PolicyNumber was seen to be the most important feature when running the model using CatBoost.

XG Boost Classifier

XGBoost showed results similar to CatBoost. However, the most important features for XGBoost were Fault_Third Partyand Fault_Policy Holder:

If you like our work, you will love our newsletter..💚

online data science mentoring one on one

Top Data Scientist Mentors from Fortune 500 Companies excited to help you out 1-on-1!

1️⃣ Explore freely →
2️⃣ Apply confidently →
3️⃣ Pay securely →
4️⃣ Book instantly

Find A Mentor Become A Mentor

About O'Fallon Labs

In O'Fallon Labs we help recent graduates and professionals to get started and thrive in their Data Science careers via 1:1 mentoring and more.

Michael Hirschberger

Michael is an experienced engineer and an expert Data Analyst. He is currently working as an engineer for NYCDEP in New York City. In OFallon Labs he worked on Data Science projects through 1on1 mentoring sessions.