A Fraud Detection Model for Car Insurance Claims

 A Fraud Detection Model for Car Insurance Claims
Michael
By Michael Hirschberger

September 19, 2024

one on one mentoring platform connecting people in data science
one on one mentoring platform connecting people in data science

Access O'Mentors

Top Data Scientist Mentors from Fortune 500 Companies excited to help you out 1-on-1!

1️⃣ Explore freely
2️⃣ Apply confidently
3️⃣ Pay securely
4️⃣ Book instantly

Fraud Detection in Auto Insurance Claims

Introduction

This analysis explores trends in car insurance claim fraud. A machine learning model was created in order to predict new claims as being fraudulent or not. The dataset consists of 15420 claims from the Jan. 1994 to Dec. 1996, with various features about the claim, including date and location of incidence, driver information, car information, policy information etc. There is also a column FraudFound_P which indicates whether or not the claim was fraudulent.

This model would enable insurance companies to target those claims that are most likely to be fraudulent. By eliminating claims unlikely to be fraudulent, human insurance claims specialists only have to examine those claims that have a high probability of fraud, thus making the determination of fraud in insurance claims a more time- and cost-effective process.

Evaluation Metric Logistic Regression Random Forest CatBoost XGBoost Classifier
ROC AUC 0.80 0.85 0.97 0.97
PR AUC 0.16 0.28 0.67 0.69

The results show poorer model performance for Logistic Regression and Random Forest and better model performance for CatBoost and XGBoost Classifier.

You can find a copy of my Python scripts for this project in This GitHub Repository

Exploratory Data Analysis

To begin, exploratory data analysis was performed to identify relationships in the data and to ascertain correlations between features and the incidence of fraud. Three relationships in particular demonstrated a clear relationship:

One-on-one Mentorship Data Science and Machine Learning One-on-one Mentorship Data Science and Machine Learning One-on-one Mentorship Data Science and Machine Learning

Data Cleansing

Next, data cleansing was performed to prepare the data for modeling. In order to run the logistic regression and random forest models, categorical data needed to be converted to numerical data. This was achieved using the scikit-learn preprocessing modules OneHotEncoder and OrdinalEncoder.

Modeling

Logistic Regression

The first type of model performed was Logistic Regression. The best performing ROC and PR curves for this model were as follows:

One-on-one Mentorship Data Science and Machine Learning One-on-one Mentorship Data Science and Machine Learning

The ROC AUC of 0.80 indicates decent performance of the model, but there is still room for improvement. The PR AUC of 0.156 indicates poor recall.

One-on-one Mentorship Data Science and Machine Learning

The feature importance bar chart shown above indicates that Fault_ThirdParty and Fault_PolicyHolder were the two features that were the largest predictors of fraud in the model.

Random Forest

The Random Forest model was performed next. The results were as follows:

One-on-one Mentorship Data Science and Machine Learning One-on-one Mentorship Data Science and Machine Learning  The ROC AUC of 0.85 indicates strong performance of the model, but there is still room for improvement. The PR AUC of 0.269 shows an improvement in recall over Logistic Regression. Overall, Random Forest performed better than Logistic Regression in this analysis. One-on-one Mentorship Data Science and Machine Learning

In contrast to Logistic Regression, PolicyNumber was seen to be the most important feature when using the Random Forest model.

CatBoost

CatBoost is a framework that can solve for categorical features. The results for the best performing model were as follows:

One-on-one Mentorship Data Science and Machine Learning One-on-one Mentorship Data Science and Machine Learning

These results showed significant improvement from the first two models, both in the ROC Curve and the PR Curve.

One-on-one Mentorship Data Science and Machine Learning

Like Random Forest, PolicyNumber was seen to be the most important feature when running the model using CatBoost.

XG Boost Classifier

XGBoost showed results similar to CatBoost. However, the most important features for XGBoost were Fault_Third Partyand Fault_Policy Holder:

One-on-one Mentorship Data Science and Machine Learning One-on-one Mentorship Data Science and Machine Learning One-on-one Mentorship Data Science and Machine Learning
If you like our work, you will love our newsletter..💚

online data science mentoring one on one

Top Data Scientist Mentors from Fortune 500 Companies excited to help you out 1-on-1!

1️⃣ Explore freely
2️⃣ Apply confidently
3️⃣ Pay securely
4️⃣ Book instantly

About O'Fallon Labs

In O'Fallon Labs we help recent graduates and professionals to get started and thrive in their Data Science careers via 1:1 mentoring and more.


Michael

Michael Hirschberger

Michael is an experienced engineer and an expert Data Analyst. He is currently working as an engineer for NYCDEP in New York City. In OFallon Labs he worked on Data Science projects through 1on1 mentoring sessions.

leave a comment


Let's Talk One-on-one!

SCHEDULE FREE CALL

Looking for a Data Science expert to help you score your first or the next Data Science job? Or, are you a business owner wanting to bring value and scale your business through Data Analysis? Either way, you’re in the right place. Let’s talk about your priorities!