Walk-through Of Patient No-show Supervised Machine Learning Classification With XGBoost In R

Overview

This is a project I’ve been working on for some time to help improve the missed opportunity rate (no-show rate) at all medical centers. It demonstrates how to extract datasets from an SQL server and load them directly into an R environment. It also demonstrates the entire machine learning process, from engineering new features, tuning and training the model, and finally measuring the model’s performance. I would like to share my results and methodology as a guide to help others starting their project or to help others improve upon my results.

Results

I’ve attained a true positive rate of 0.9026 (90.2%) and a true negative rate of 0.9220 (92.2%) when setting the threshold to 0.22. Alternatively, I attained a true positive rate of 0.8220 (82.2%), a true negative rate of 0.9567 (95.6%), and a Kappa score of 0.7506 when setting the threshold to 0.34.

Features Used

Clinic No-show Rate – Patient’s no-show rate by clinic.
Clinic cancellation rate – Patient’s cancellation rate by clinic.
New Patient to Clinic – Has patient visited clinic in past 24 months.
Department No-show Rate – Patient’s no-show rate by department.
Department Cancellation Rate – Patient’s cancellation rate by department.
New Patient to Department – Has patient visited department in past 24 months.
Appointment Lead Time – Number of days patient is waiting for appointment.
Days Since Last Appointment – Number of days since patient’s previous appointment.
Consecutive No-shows By Patient – Number of most recent consecutive no-shows patient has accrued.
Consecutive No-shows By Clinic – Number of most recent consecutive no-shows patient has accrued by each clinic.
Appointment Sum Per Day – Number of appointments patient has on the same day of predicted appointment.
Appointment Sum – Patient’s appointment sum in past 12 months.
Consult Sum – Patient’s consult sum in past 12 months.
Admission Sum – Number of patient admissions in past 12 months.
Department – Appointment’s department.
Length of Appointment – Appointment’s duration in minutes.
Appointment’s Month – Appointment’s month (Jan, Feb, etc.).
Appointment’s Weekday – Appointment’s Weekday (Mon, Tues, etc.).
Appointment’s Hour – Appointment’s hour (8am, 9am, etc.).
Consult Lead Time – Number of days elapsed since consult requested.
Gender – Patient’s gender.
Age – Patient’s age.
Marital Status – Married or single.
Race – Patient’s race.
Address – Patient’s address (longitude & latitude).
Homeless – Patient diagnosed as homeless.
Substance Abuse – Patient diagnosed as a substance abuser.
Low Income – Patient diagnosed as having low income.

Loading Required Packages

The first step is to install and load the following R packages which are required to perform each step in this project. Package descriptions were sourced from their respective reference manual which can be found at https://cran.r-project.org/web/packages/.

Building The Datasets With SQL

All database table and column names have been given aliases for security reasons. In this next step, we will gather a period of two years of historical appointment information as well as patient demographic information from VHA’s Corporate Data Warehouse. We will connect R directly to Microsoft SQL Server via an ODBC connection using the RODBC package. We will use Structured Query Language (SQL) to pull the information from 11 tables. We will set three variables; start.date, end.date, and station, which allows us to reuse the same dates and facility number across each query.

We will store each table in its own data frame and combine them later because R provides us more control joining datasets with the dplyr package instead of SQL. Also, it makes writing SQL statements much easier when writing many small queries instead of one large complex and slow SQL query.

Preprocessing And Cleaning Data

Now that the raw datasets are stored in separate data frames, the next step is to clean and combine them. This step also requires a lot of domain specific knowledge. For example, we will remove all deceased patients because they will no longer be scheduling appointments and will negatively affect the model if left in. Also, we will remove patients flagged as test patients in the database. We will also remove clinics whose no-show performance is not monitored. Next, we’ll remove inpatient appointments and appointments with data entry errors. Lastly, even though appointments canceled by the facility after the appointment date/time count toward the no-show rate, we’ll remove those because they are out of the patient’s control.

We dropped 173,361 rows that would have otherwise negatively affected our model.

Next, we’ll combine the homeless, subabuse, and lowincome data frames to our working data frame d. We’ll remove duplicate patient ID records by calling subset(x, !duplicated(PatientID)) on each data frame because each patient may have multiple entries, but we only need one instance for each unique patient. Then we assign them to their matching patient IDs in our working data frame.

To read original article and R code – click here