This is a project I've been working on for some time to help improve the missed opportunity rate (no-show rate) at all medical centers. It demonstrates how to extract datasets from an SQL server and load them directly into an R environment. It also demonstrates the entire machine learning process, from engineering new features, tuning and training the model, and finally measuring the model's performance. I would like to share my results and methodology as a guide to help others starting their project or to help others improve upon my results.
I've attained a true positive rate of 0.9026 (90.2%) and a true negative rate of 0.9220 (92.2%) when setting the threshold to 0.22. Alternatively, I attained a true positive rate of 0.8220 (82.2%), a true negative rate of 0.9567 (95.6%), and a Kappa score of 0.7506 when setting the threshold to 0.34.
The first step is to install and load the following R packages which are required to perform each step in this project. Package descriptions were sourced from their respective reference manual which can be found at https://cran.r-project.org/web/packages/.
All database table and column names have been given aliases for security reasons. In this next step, we will gather a period of two years of historical appointment information as well as patient demographic information from VHA's Corporate Data Warehouse. We will connect R directly to Microsoft SQL Server via an ODBC connection using the RODBC package. We will use Structured Query Language (SQL) to pull the information from 11 tables. We will set three variables; start.date, end.date, and station, which allows us to reuse the same dates and facility number across each query.
We will store each table in its own data frame and combine them later because R provides us more control joining datasets with the dplyr package instead of SQL. Also, it makes writing SQL statements much easier when writing many small queries instead of one large complex and slow SQL query.
Now that the raw datasets are stored in separate data frames, the next step is to clean and combine them. This step also requires a lot of domain specific knowledge. For example, we will remove all deceased patients because they will no longer be scheduling appointments and will negatively affect the model if left in. Also, we will remove patients flagged as test patients in the database. We will also remove clinics whose no-show performance is not monitored. Next, we'll remove inpatient appointments and appointments with data entry errors. Lastly, even though appointments canceled by the facility after the appointment date/time count toward the no-show rate, we'll remove those because they are out of the patient's control.
We dropped 173,361 rows that would have otherwise negatively affected our model.
Next, we'll combine the homeless, subabuse, and lowincome data frames to our working data frame d. We'll remove duplicate patient ID records by calling subset(x, !duplicated(PatientID)) on each data frame because each patient may have multiple entries, but we only need one instance for each unique patient. Then we assign them to their matching patient IDs in our working data frame.
To read original article and R code - click here