Study reveals AI models that analyse medical images can be biased

Research has found that AI models, which can analyse medical images to predict a patient’s race, gender, and age, take shortcuts when making medical diagnoses.

AI models often play a role in medical diagnoses, especially when analysing medical images such as X-rays.

However, studies have found that these models don’t always perform well across all demographic groups, usually faring worse on women and people of colour.

In 2022, MIT researchers reported that AI models can make accurate predictions about a patient’s race from their chest X-rays — something that the most skilled radiologists can’t do.

That research team has now found that the models that are most accurate at making demographic predictions also show the biggest ‘fairness gaps’ — that is, discrepancies in their ability to accurately diagnose images of people of different races or genders.

The findings suggest that these models may be using ‘demographic shortcuts’ when making their diagnostic evaluations, which lead to incorrect results for several groups.

Removing bias from AI models

As of May 2024, the FDA has approved 882 AI-enabled medical devices, 671 of which are designed for radiology.

“Many popular machine learning models have superhuman demographic prediction capacity — radiologists cannot detect self-reported race from a chest X-ray,” explained Marzyeh Ghassemi, the study’s senior author.

“AI models that are good at predicting disease, but during training are learning to predict other things that may not be desirable.”

In this study, the researchers explored why these models don’t work as well for certain groups. In particular, they wanted to see if the models were using demographic shortcuts to make predictions from medical images that were less accurate for some groups.

Using publicly available chest X-ray datasets from Beth Israel Deaconess Medical Center in Boston, the researchers trained AI models to predict whether patients had one of three different medical conditions: fluid buildup in the lungs, collapsed lung, or enlargement of the heart.

Then, they tested the models on X-rays that were held out from the training data.

Overall, the models performed well, but most of them displayed “fairness gaps” — that is, discrepancies between accuracy rates for men and women, and for white and Black patients.

Models are not always fairer

However, those approaches only worked when the models were tested on data from the same types of patients that they were trained on.

When the researchers tested the ‘debiased’ models using the BIDMC data to analyse patients from five other hospital datasets, they found that the models’ overall accuracy remained high, but some of them exhibited large fairness gaps.

This is worrying because, in many cases, hospitals use models developed using data from other hospitals, especially when an off-the-shelf model is purchased.

Ghassemi explained: “We found that even state-of-the-art AI models which are optimally performant in data similar to their training sets are not optimal in novel settings.”

“Unfortunately, this is actually how a model is likely to be deployed. Most models are trained and validated with data from one hospital, or one source, and then deployed widely.”

Subscribe to our newsletter

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Featured Topics

Partner News

Advertisements



Similar Articles

More from Innovation News Network