Building a COVID-19 Vulnerability Index

Summary: Since COVID-19 is occupying most of our thoughts these days, it seems appropriate to highlight where AI/ML is making a contribution to getting us out of our homes and back to work.

Since COVID-19 is occupying most of our thoughts these days, it seems appropriate to highlight where AI/ML is making a contribution to getting us out of our homes and back to work. Only three mentions have appeared in the last few weeks that I could find.

Image Analysis of Lung CT Scans

Alibaba Group’s research and innovation institute DAMO Academy produced an image analysis using CT scans of the lungs of hospitalized patients suffering respiratory distress to classify the disease as either COVID-19 or some other type, typically pneumonia from the flu. The resulting accuracy was 96%.

Unfortunately the candidates were already in advanced stages of the disease and CT scans are expensive and not available to everyone. On the AI/ML scale it’s a win but not one that’s likely to accelerate the solution to our larger social and medical problem.

Predicting the Most Vulnerable Prior to Testing

COVID-19 is so new that there simply isn’t much real data to analyze. What we believe we see from anecdotal data is that it’s 6X to 10X more serious for some subsets of the population than for others. A week ago we thought that was just older people but recently we learned that the risk for some millennials was also quite high. Given the shortage of test kits and the false positive rates of some of those early kits leading to biased samples that would come from looking at testing data, the problem for modeling is significant.

There is a bright spot however. Over the last week two different groups have claimed to develop fairly straightforward predictive models that can determine before testing the likelihood that an individual will require hospitalization.

On March 19 the WSJ ran a short article about Persivia Inc., a Marlborough, Mass., software company, rolling out an AI/ML model that can predict which patients likely have COVID-19 before they are tested. The model is being integrated into Persivia’s existing healthcare platform called CareSpace.

It uses both structured and unstructured NLP data from the patient’s health record to identify those most likely to have COVID-19 even in the absence of some of the telltale symptoms during the admissions screening process and before testing. No other detail about the model itself was revealed.

All of which makes the paper uploaded to arXiv.org on March 16 all the more interesting: “Building a COVID-19 Vulnerability Index” since the authors reveal in some detail both their features and their model building approach. The authors are all members of ClosedLoop.ai which like Persivia provides an AI-based healthcare platform to hospitals and other providers.

Both groups start from the assumption that identifying high-risk groups for hospitalization and potential respiratory complications is likely to be similar for COVID-19 and other disease source such as the flu. The Wuhan experience “suggests that the risk of death increases with age, and is also higher for those who have diabetes, disease, blood clotting problems, or have shown signs of sepsis”.

While age is no longer considered a unique identifying factor, a medical history of certain underlying conditions like heart disease or those listed above is highly correlated and can be seen in the subjects’ healthcare records using the CCSR standard codes. Unfortunately a simple examination of prior disease or admissions isn’t a good predictor.

“More than 55% of Medicare beneficiaries meet at least one of the risk criteria listed by the CDC. People with the same chronic condition don’t have the same risk, and simple rules can fail to capture complex factors like frailty which makes people more vulnerable to severe infections.”

Not having large scale data identifying COVID-19 patients, the ClosedLoop authors used a surrogate set of targets selected for “near-term risk of severe complications from respiratory infections (e.g. pneumonia, influenza). Specifically, 4 categories of diagnoses were chosen from the Clinical Classification Software Refined (CCSR) classification system:

RSP002 – Pneumonia (except that caused by tuberculosis)
RSP003 – Influenza
RSP005 – Acute bronchitis
RSP006 – Other specified upper respiratory infections”

From here the modeling process was a fairly straightforward process of acquiring an anonymized medical database resulting in a training and test set of approximately 1.9 million subjects. They defined 559 features of which about 37 proved particularly predictive.

What sets the ClosedLoop effort apart however is that while they built a high-accuracy model ultimately using XGBoost they also wanted to provide a model that could be used by those without access to sophisticated AI/ML platforms or knowledge.

To accomplish that goal they built a simple linear regression model with greatly simplified input features which any user with access to Excel could implement directly. Both the tree model and the regression model are open source on github as is a list of all the 559 features originally considered. The results of both are quite useful in their accuracy.

Model Type ROC AUC
Logistic Regression . 731
XGBoost Diagnosis History + Age . 810

Since the regression model is easy to interpret, here are the 25 scored features.

Let’s hope there are many more examples in the coming weeks where AI/ML can help resolve this extraordinary situation. Meanwhile, be safe, wash your hands and maintain your social distance.

Other articles by Bill Vorhies

About the author: Bill is Contributing Editor for Data Science Central. Bill is also President & Chief Data Scientist at Data-Magnum and has practiced as a data scientist since 2001. His articles have been read more than 2.1 million times.

[email protected] or [email protected]

Building a COVID-19 Vulnerability Index

Leave a Reply Cancel reply