Training an AI Doctor

By Tyler Schnoebelen, August 17, 2016

Some of the earliest applications of artificial intelligence in healthcare were in diagnosis—it was a major push in expert systems, for example, where you aim to build up a knowledge base that lets software be as good as a human clinician. Expert systems hit their peak in the late 1980s, but required a lot of knowledge to be encoded by people who had lots of other things to do. Hardware was also a problem for AI in the 1980s.

The promise of AI in diagnostics is that you can help people in locations where there aren’t enough doctors. Computers are not as creative as human pattern matchers, but that fact also means they can be more consistent than people. In addition to access and affordability, then, there’s the possibility that AI doctors could actually promote better outcomes than the ones with stethoscopes around their necks.

But how do you send a computer to medical school? And where do they go for their Continuing Medical Education credits?

Training an AI Doctor – By Tyler Schnoebelen

Star Voyager’s Doctor is an artificially intelligent doctor–is he coming soon?

Diagnosing pregnancy problems

Let’s start with an example of how statistical models could come to conclusions earlier than clinicians. Preeclampsia is a leading cause of death among pregnant women in the Western world and the main cause of fetal complications. 15% of first-time pregnancies involve women who have high blood pressure and half of those end up with preeclampsia. To solve it, you have to deliver the baby even if that makes it premature.

The problem is: does a patient have preeclampsia or are they developing it? If it’s not actually preeclampsia, you want to start anti-hypertensive treatment. An example of the promise of personalized statistical healthcare is Velikova and Lucas (2014). If their models work beyond their small sample size, they’d have a system that would diagnose preeclampsia a median of 4 weeks earlier than human clinicians.

In work like this, choosing the data carefully is important since it’s easy to accidentally lump people who weren’t actually recorded as non-preeclampsia. There are ways to being robust to noisy training data, but being able to say “the stuff in Training Category A really belong there and the stuff in Training Category B really belong there” is best. Similarly, teachers avoid peppering lectures to human medical students with errors, falsehoods, and noise.

Getting agreement

A basic rule-of-thumb is that if you can’t get human beings to agree on what to call something, you’re going to have a hard time using machine learning to do it automatically. So an important part of the design of any machine learning project is piloting the project with people.

Let’s take a look at a healthcare project done by researchers at Beth Israel using CrowdFlower—you can read all the details in their published paper here.

For a variety of conditions, it’s necessary for pathologists to identify sections of images that are problematic. This can just help with diagnosis and it can also be used as training data for machine learning. In the image below, the crowd is being asked to draw circles around cell nuclei.

The following table compares research fellows and the crowd to expert pathologists. The takeaway is that the crowd can be pretty good at this task. Or rather, they are when the task is well designed.

Designing a task for humans gets you data that you can train models from. Statistical models are accurate to the extent that the training data in them is accurate and represents the problem space. One of the main ways to mess up a machine learning/data science problem is to make it really hard for humans to give you good classifications for your training data. Maybe the categories are ambiguous or the descriptions are unclear or contradictory. Or maybe you’re just asking a whole lot of someone’s brain.

Computers don’t get tired when you tell them to find nuclei in huge images, but humans do. Giving humans images that are 400×400 pixels rather than 800×800 pixels increases precision by 2.4 times and recall by 3.0. You basically can’t get good results if you use enormous images unless, perhaps, you have really motivated experts. Even then, it’s worth pointing out that pathologists themselves tend to only have moderate inter-annotator agreement especially if their search space is an exhaustingly big image.

Measuring accuracy

You can also use deep neural networks to do this kind of detection as you see in this work on mitosis detection (cells separating). The approach in that paper gets an automatic system with a precision of 0.88 and recall of 0.70. But again, you need to give the system some meaningful examples to learn from.

The training data was developed by expert pathologists who annotated big images. The data included 66,000 mitosis pixels and 151,000,000 non-mitosis pixels. In other words, 0.04% of the pixels show mitosis while the vast majority of the pixels are just “background” pixels that are trivial to get right.

What you’ve got to figure out is non-mitotic nuclei and other things that are plausibly confusable. Like if you were searching for zebras, horses might confuse you, however it’s unlikely ducks or ambrosia salad would be befuddling. You have to measure accuracy but clearly guessing non-mitosis every time would mean you’re 99.96% correct. That would be clinically useless or even dangerous.

In other words, part of training someone is evaluating them—you don’t certify healthcare workers by giving them points for spelling their names correctly. Nor do you give them a series of problems entirely designed to stump House.

Simple interventions can mean a lot

Doctors in Duluth know that when someone has been traveling in the tropics that their headache could mean something quite different than most of their patients. Most general practitioners or ER doctors there will fall back to the fundamental skill of asking for help and that’s where research and referrals come into play.

An “AI doctor” doesn’t have to know everything about everything. And thinking of them as assistants is more realistic. In most parts of the world, relatively minor interventions can make a difference. Along these lines, consider chatbots.

It is fairly straight-forward to build chatbots that send reminders, ask (and record) basic information, and answer common questions. For example, the US government tracks veterans who are receiving chemotherapy and recovering at home. Particular inputs to an automated system can trigger human support systems to get more involved. This can be a useful way of generating training data but that training data is a byproduct of the real goal, which is to help individual, caregivers, and healthcare providers to give the best care possible. AI doctors are unlikely to replace doctors anytime soon–neither the technology nor society are ready for that. But trained on relevant data and assessed with meaningful measures, AI doctors will help us extend care.

***

Tyler Schnoebelen

Tyler Schnoebelen is the former Founder and Chief Analyst at Idibon, a company specializing in cloud-based natural language processing. Tyler has ten years of experience in UX design and research in Silicon Valley and holds a Ph.D. from Stanford, where he studied endangered languages and emoticons. He’s been featured in The New York Times Magazine, The Boston Globe, The Atlantic, and NPR.

Training an AI Doctor – By Tyler Schnoebelen

Diagnosing pregnancy problems

Getting agreement

Measuring accuracy

Simple interventions can mean a lot

Leave a Reply Cancel reply