Can an Algorithm Diagnose Pneumonia?

Stanford researchers claim they can detect the lung infection more accurately than an experienced radiologist. Some radiologists aren’t so sure.

Stanford radiologist Matthew Lungren, left, meets with graduate students Jeremy Irvin and Pranav Rajpurkar to discuss the results of detections made by the algorithm. L.A. Cicero

Pneumonia puts a million adult Americans in the hospital each year and kills 50,000. If a doctor suspects a patient has pneumonia, he or she will generally order a chest X-ray. These X-rays must be interpreted by a doctor, of course. But now, Stanford researchers have developed an algorithm they say can diagnose pneumonia on X-rays better than experienced radiologists.

“The advantage an algorithm has is that it can learn from hundreds of thousands of chest X-rays and their corresponding diagnoses from other experts,” says Pranav Rajpurkar, a graduate student in the Stanford Machine Learning Group, who co-led the research. “When do radiologists ever get a chance to learn from hundreds of thousands of other radiologists' diagnoses and find patterns in the images leading to those diagnoses?”

The algorithm, called CheXNet, can also diagnose 13 other medical conditions, including emphysema and pneumothorax (air trapped between the lung and chest wall). The team built the algorithm using a public dataset from the National Institutes of Health (NIH), which contained more than 100,000 chest X-ray images labeled with 14 possible conditions. The dataset was released along with an initial diagnosis algorithm, which NIH encouraged other researchers to advance.

Rajpurkar and his fellow Machine Learning Group members decided to take on the challenge. The researchers had four Stanford radiologists mark possible indications of pneumonia on 420 of the images. Using this data, within a week they created an algorithm that could accurately diagnose 10 conditions. Within a month the algorithm could outperform previous algorithms at diagnosing all 14 conditions. At this point, CheXNet diagnoses agreed with a majority opinion of radiologists more often than an individual opinion of any one radiologist.

The research was published this month in the scientific preprint website arXiv.

Other diagnostic algorithms have made the news recently. Canadian and Italian teams have both developed algorithms for diagnosing Alzheimer's disease from brain scans. The distribution of the plaques in the brain that characterize the disease are too subtle for the naked eye, but the researchers say AI technology can detect abnormal patterns. Rajpurkar and his fellow researchers at Stanford's Machine Learning Group have also developed an algorithm for diagnosing heart arrhythmias, analyzing hours of data from wearable heart monitors. Other pneumonia algorithms have been developed from the NIH data, but the Stanford one is so far the most accurate. 

CheXNet could be especially helpful in places where people don’t have easy access to experienced radiologists, the team says. It could also be useful as a sort of triage, identifying which cases likely need emergency attention and which do not.  The team also developed a tool that produces a map of potential pneumonia indicators on X-rays, giving a handy visual guide for doctors. 

While the team is optimistic about CheXNet’s diagnostic abilities, they’re cautious about its limits.

“AI is a powerful tool, but it takes years of experience and many tough hours to intuit how to wield it, and it's just as hard to determine where we can use it for most positive impact,” Rajpurkar says. 

While there are a number of deep learning algorithms in development, none have yet gone through the rigorous testing and approval process necessary for use on real patients. 

Paul Chang, a radiology professor and vice chairman of the department of radiology at the University of Chicago, sounds a skeptical note about CheXNet and similar deep learning programs. Physicians already use algorithms to aid in diagnosis of any number of conditions, Chang says. These algorithms rely on a preformed model of what the condition looks like: cancers are larger and spikier than benign masses, for example. Deep learning programs, in contrast, are meant to figure out what features are significant on their own, by crunching enormous amounts of data. But this also means that they can take the wrong cues. Chang gives the example of a deep learning algorithm that learned the difference between various types of X-rays: hands, feet, mammograms. But researchers discovered that the program had simply learned to recognize mammograms by the fact that the main image was on the side of the film rather than in the center (since breasts are attached to the chest wall they appear on the edge of the film in a mammogram image. Hands or feet, in contrast, will appear in the center of the X-ray). The algorithm wasn’t learning anything significant about breasts, just about their position on the screen.

“This is very early times,” says Chang, who points out that the CheXNet results have not been peer-reviewed. “Deep learning has great potential, but we in medicine and in radiology tend to be early in the hype cycle, but it takes us longer to adopt. We will learn how to appropriately consume it.”

Get the latest stories in your inbox every weekday.