Fraud Detection in Healthcare: The Latest Data Science Project by O'Labs Mentees (Public Dataset)
By Pradyumna Lanka

December 3, 2023

Fraud Detection in Healthcare Claims (at Provider Level)

Before you start reading, please note that this is the Link to the full GitHub Repository including the Python codes and Jupyter Notebooks. Below is an overview and summary of this project.

1. The Business Problem

Health care fraud is a huge problem in the United States. As the FBI website notes, health care fraud is not a victimless crime and it causes tens of billions of dollars in losses each year. It can raise health insurance premiums, expose you to unnecessary medical procedures, and increased taxes. In this project, I will predict the potentially fraudulent providers based on providers claims. The dataset for this analysis is from Kaggle. It contains information about claims and beneficiary information for several patients and providers.

2. Exploratory data analysis (EDA)

As expected the data is heavily unbalanced with a lot more providers who have not committed fraud than those who did commit fraud by a factor of 9 with 506 out of 5410 providers potentially committing fraud. From the beneficiaries dataset, it is clear and that the most of the beneficiaries were covered for the entire 12 months in an year and a large percentage of them have chronic conditions. Furthermore we can can look at procedure codes and diagnostic codes for the inpatient and outpatient claims and extract the top100 most common codes as features to improve our prediction of fraudulent providers. A word cloud of the diagnosis codes for outpatient claims is shown below.

Alt text

An important feature in prediction of fraudulent providers could be the days of stay at the hospital for inpatient claims which is shown in the figure below. Alt text

3. Data cleaning and Preparation

Since some of the variables are categorical, we need to convert them currently as strings to categorical variables. Furthermore, a lot of variables on the patient conditions are further converted into Boolean to be used as features for the machine learning algorithms. Additionally we can look at in-patient and out-patient claim process times as features. We extract the top 100 diagnosis codes and procedure codes for inpatient and outpatient claims from the claims data. Since, we are provided with the beneficiary and claims data and we are interested in the classification of providers, we first join the claim data with the beneficiary, and group the joined data by providers. We then use the aggregated information from providers as features for our machine learning classification models.

4. Modeling & Optimization

We then split the data into training and test set and aim to fit the following machine on the train data features: 1. Linear SVM 2. Logistic regression 3. Random Forest 4. XG-Boost 5. CatBoost 6. K-nearest neighbors Furthermore we use either grid search or randomized search along with cross-validation for choosing the optimal set of hyperparameters for the above classifiers. We then save the model with the optimal set of hyperparameters.

5. Evaluation

We evaluated the best models with optimal set of hyperparameters for each classifier tested on the hold-out test data. The ROC curves which plot the false-positive rates and the true-positive rates for several thresholds as well as the precision-recall curves are shown for all the classifiers. Furthermore, the classifiers are evaluated in several metrics including, accuracy, recall, precision, F1-score, AUC for the ROC curve. Alt text Alt text

Classifier F1 Score Precision Recall Accuracy AUC
SVM 0.56 0.86 0.42 0.94 0.87
Logistic Regression 0.44 0.29 0.90 0.80 0.91
Random Forest 0.58 0.83 0.44 0.94 0.96
XGBoost 0.60 0.70 0.52 0.94 0.95
CatBoost 0.70 0.83 0.60 0.95 0.96
KNN 0.56 0.94 0.40 0.94 0.95

From the above performance metrics, it appears that CatBoost provides a good balance between precision and recall. Hence this is the model we would deploy.

6. Deployment

Since, the trained CatBoost model has several hundreds of features, we would want to deploy a model with fewer features. The plot below shows the feature importance for the top features. Alt text

From the plot, it is clear that top 5 features have very high scores. Hence we select the 5 most informative features from CatBoost based on the feature importance and train a new model on those 5 features which we will deploy on Streamlit. The distribution of values for the top 5 features is shown below Alt text The ROC Curve and the precision-recall curve for the CatBoost classifier with just the top 5 features is shown below. Alt text

The deployed model is hosted on Streamlit at Provider-fraud-detection where the user can enter the provider's features and get a prediction on the whether a provider is potentially fraudulent or not. A screenshot of the App is displayed below

Alt text

If you like our work, you will love our newsletter..💚

About O'Fallon Labs

In O'Fallon Labs we help recent graduates and professionals to get started and thrive in their Data Science careers via 1:1 mentoring and more.


Pradyumna Lanka

Pradyumna is a Post-doc researcher at Northeastern University with a PhD in Pscicology from Univeristy of California, Merced. He is a recent graduate mentee at OFallon Labs in Data-Science one-on-one mentoring program.

leave a comment

Let's Talk One-on-one!


Looking for a Data Science expert to help you score your first or the next Data Science job? Or, are you a business owner wanting to bring value and scale your business through Data Analysis? Either way, you’re in the right place. Let’s talk about your priorities!