NEW YORK – More and more, artificial intelligence- and machine learning (AI/ML)-enabled medical devices are being adopted to assess disease risk and to diagnose and manage chronic and acute illness. By gleaning complex patterns in vast datasets, these approaches promise to augment clinician and technician expertise, bringing high-quality, personalized care to places like tiny mountain towns and cash-strapped urban clinics.
But even as AI/ML promises to deliver a utopia of health equity, racial bias permeates medical research and healthcare, and the datasets fed to algorithms are no exception. Without continuous vigilance from clinicians and patients, experts caution that AI/ML-based devices are likely to "learn" to amplify lurking biases and exacerbate existing inequities rather than alleviate them.
Latrice Landry, a translational genomicist at the University of Pennsylvania's Center for Applied Genomics, said that this is especially true "as we try to implement precision medicine in an evidence-based way" using AI/ML approaches.
Along with a growing number of physicians and scientists across specialties and disciplines, Landry — who is also a fellow at the National Institutes of Health's Artificial Intelligence/Machine Learning Consortium to Advance Health Equity and Researcher Diversity (AIM-AHEAD) program — is scrutinizing AI/ML-enabled healthcare datasets and algorithms, hoping to purge them of bias.
"The problem is big," she said, and the urgency to inspect and correct biases is mounting alongside the astronomical acceleration of AI/ML-based healthcare.
Douglas Flora, executive medical director of oncology services at St. Elizabeth Healthcare in Northern Kentucky, said that even the "super experts" in generative AI can only date their expertise to the tool's invention seven years ago.
Flora, who is also the founder and editor-in-chief of a new journal called AI in Precision Oncology, believes that this technology can bring more equity in care. Considering his own experience as a patient battling kidney cancer, he is also hopeful that AI/ML-enabled medicine will save clinicians time on routine tasks so that they can instead invest their energy into creating more compassionate institutions.
Melissa Wong, the director of informatics and artificial intelligence strategies for the Ob/Gyn department at Cedars-Sinai Medical Center in Los Angeles, said that more stakeholders must also step up to the plate.
As a clinician with bioinformatics expertise who also develops AI/ML-enabled algorithms, Wong volunteered to craft her department's approach, in part because she saw that clinicians were often the last to the table in informatics discussions.
While "there's tremendous potential to do good" using AI approaches in medicine, she also warned that if clinicians do not help root out AI/ML algorithm bias, "the mistakes of our past are going to become the recommendations of our future."
This is because biases impacting diagnosis and care can creep in at any stage of healthcare interaction and in all phases of the so-called Total Product Lifecycle of medical device development, such that the AI/ML approaches used in assay design are as vulnerable to bias as those used in data interpretation.
Genomics-based diagnostic and prognostic testing — including pharmacogenomics, polygenic risk scores, epigenomics, and precision medicine — may be particularly vulnerable to AI/ML racial bias due to its foundation on unrepresentative databases.
Within in vitro diagnostics, AI/ML is primarily being used in digital pathology, whose tools are increasingly being adopted despite concerns about cost, reimbursement, and the potential for biases in training data to perpetuate inequities.
In all these applications, experts say equitable outcomes require a diversity of patients to be included, from the earliest stages of development through postmarket assessments and beyond.
Specific harms
Bias causes inaccuracies in medical devices that skew patient outcomes. According to a recent UK report on the subject, myriad tiny misalignments can also accumulate and amplify each other.
The report highlights the case of Yvonne, a 60-year-old Black British woman of Caribbean heritage with COVID.
After a delay caused by falsely high pulse oximeter readings, she was finally given oxygen. The mask was too large to function well on a small female face, and she ultimately required ventilation, but the default settings on the ventilator were for men and had to be adjusted lest the device permanently damaged her lung tissue. Later, a so-called racial correction commonly applied in estimating glomerular filtration rate resulted in her kidney injury going unnoticed for days, ultimately requiring a prolonged stay in the intensive care unit.
Although the UK panel concluded that biases in medical devices were by and large unintentional — or even sometimes well-intentioned but misguided — this "unintentionality" means the problems need to be addressed holistically, across the entire healthcare ecosystem.
Similarly, bias inherent in pharmacogenomic testing negatively impacts care for some patients.
The state of Hawaii was awarded $916 million this year in a lawsuit that asserted the makers of Plavix (clopidogrel) didn't do enough to get the word out that *2/*3 variants of CYP2C19 rendered the drug ineffective.
The variant is thought to be more highly expressed among people with Asian ancestry, and for a fourth-generation fisherman named Rudy, this bias led to his untimely death. As described in an editorial in Personalized Medicine, despite diligently taking clopidogrel every day for three months after undergoing angioplasty and stent surgery, Rudy died of a heart attack on the way to hospital. His autopsy revealed complete arterial blockage at the stent site and post-mortem genetic testing uncovered his *2/*3 variant status.
A clinician's role
Despite Flora's techno-optimism, he is also vigilant about bias. He is currently spearheading the adoption of AI/ML-based imaging interpretation at St. Elizabeth's to bring high-quality lung cancer screening to patients in remote parts of Appalachia.
"We are responsible for this tool, just like we're responsible for a stethoscope or a CT scanner or an MRI," he said in an interview.
Tools augment a clinician's decision-making, but they should not make the decisions. "It's really critical that human beings remain at the wheel and that, at every step of this process, we review every piece of work," he said.
Flora said he started his journal to drive peer-reviewed trials and research into AI/ML-enabled medicine, "so that oncologists and decision makers at hospitals and practices could trust that this had been vetted, versus [relying on] a vendor sales pitch from industry."
Vigilance means asking developers, "which patient population did you survey?" he said.
Although it seems like a reasonable question, in practice AI/ML training data is often siloed intellectual property. Flora asks authors to describe the makeup of training sets in AI in Precision Oncology manuscript submissions, but a lot of teams are not yet ready or willing to share, he said.
"As we've gotten more people talking about ethics and regulation, and as organizations like the American Society of Clinical Oncology and the American Medical Association have weighed in on AI/ML, we're starting to establish at least some loose guardrails for what's appropriate and what's not appropriate," he said.
Clinicians also need to hold themselves accountable for understanding the potential for bias, Flora said.
And, in the case of AI/ML-based patient support tools at least, nondiscrimination is now also mandated under the Affordable Care Act.
Specifically, providers covered under ACA must make reasonable efforts to find out if the AL/ML tools they use incorporate protected features like race or sex, and they must exercise due diligence when acquiring and using AI/ML-enabled tools to mitigate the risk of discrimination.
Built on bias
As noted in a report from the National Human Genome Research Institute in May, lack of diversity in genomics research can lead to disparities that have both scientific and clinical consequences.
For instance, in addition to disparate access to microsatellite instability genomic testing, differential biomarker expression can also cause disparities in colorectal cancer care for people with African ancestry.
In another example, genetic tests to determine the risk of developing a type of heart disease called hypertrophic cardiomyopathy misclassify benign mutations as pathogenic for patients with African or unspecified ancestry, according to one study. The authors developed a model and found that "the inclusion of even small numbers of Black Americans in control cohorts probably would have prevented these misclassifications."
Landry and her colleagues have also shown that hypertrophic cardiomyopathy genetic tests have lower clinical yields for underrepresented minorities.
Bias lurking in commonly used medical devices also harms patients and will potentially harm them further if it is incorporated into AI/ML training data.
For example, a racial bias in finger-clip pulse oximeter readings was first reported in 1990 but was ignored for decades.
These devices were developed and calibrated on light skin and can give falsely high readings for people with darker skin, as was the case for Yvonne in the UK. According to a study by one California health system, oximeter inequity resulted in an average delay of nearly five hours before Black patients with COVID got necessary oxygen treatment.
Lack of diversity also impacts the imaging datasets used to train some dermatology and oncology AI/ML tools to diagnose things like Lyme disease and skin cancer. Melanoma may be rare in darker skin, but the American Academy of Dermatology points out that the skin cancer death rate among African Americans is higher than it ought to be, in part due to late diagnoses.
In public dermatology datasets, ethnicity information is only available for approximately 1 percent of the images, according to a study in the Lancet. This known bias likely skews AI/ML training, and it seems to persist even in more recent efforts. Among one recent sample of images from nearly 800 patients, 57 were Asian, 38 were Hispanic, and only seven were African American.
Another group recently reported that it developed a way to create AI-generated images of darker skin as a means of improving the diversity of the databases used to train AI/ML-enabled models.
Setting the standard
Even going through regulatory review doesn't necessarily eliminate all bias.
A handheld AI/ML-based melanoma screening device called DermaSensor was cleared in January, and the manufacturers claimed to have used a diverse training dataset of lesion images from company-sponsored and independent investigator clinical studies.
But, other than a single patient who self-identified as Black, all the clinical trial participants were white, with 93 percent of the 394 patients deemed to be a Fitzpatrick skin type of III or less. The US Food and Drug Administration's authorization letter stipulated postmarket clinical validation in patients "from demographic groups representative of the US population," explicitly including populations with the darker Fitzpatrick skin phototypes IV, V, and VI.
The FDA cleared an AI/ML-enabled cardiology device in June called the Tempus ECG-AF that helps identify patients at increased risk of atrial fibrillation or flutter. Among approximately 450,000 patients in the training and model tuning datasets, 96 percent were white, according to the authorization letter, while the racial distribution of the 4,017 patients in the clinical study population was 82 percent white, 11 percent Black, 3 percent Asian, and 5 percent designated as other or unknown.
"When we have limited diversity in training, we obtain other independent datasets for validation that contain sufficient diversity and include underrepresented populations," Brandon Fornwalt, senior VP of cardiology at Tempus, noted in an email, adding the company worked with the FDA to align on this approach to "represent the demographic diversity of the intended use population."
For the ECG-AF device, Fornwalt said Tempus required a longitudinal dataset and chose one from a community hospital system in Pennsylvania that had electronic health records for patients dating back 30 years, with the additional advantage of including rural residents who are often underrepresented.
Tempus' sequencing services have also allowed it to create "one of the most diverse datasets in oncology" for model building and training, Fornwalt said. Separately, Tempus researchers recently described a method to impute race from tumor samples as another potential way to increase dataset diversity.
Within digital pathology, Paige was granted de novo marketing authorization by the FDA in 2021 for an AI/ML-enabled prostate cancer test.
The authorization letter shows roughly 82 percent of data used to build the model was from white patients, with 8 percent from Black patients, and 3 percent from patients with East Asian or Indian subcontinent ancestry. The clinical trial had similar proportions, and in a category designated Native Hawaiian and Pacific Islander there was no patient data used to build the model and one patient in the trial.
These proportions are actually significantly better than a typical prostate cancer clinical trial; on average they're 96 percent white, according to a 2020 study. But they contrast with the fact that Black men have a 65 percent higher risk of developing prostate cancer and more than double the risk of dying from it compared to white men, according to a report from the American Association for Cancer Research. Indeed, Black men in the US had the highest overall age-adjusted prostate cancer mortality in the world in 2019, a fact at least partially related to being underrepresented in clinical trials.
Still, efforts to correct for the historic lack of diversity in genomics, PGx, and personalized medicine datasets are yielding positive results for breast cancer, polygenic risk scores, and other conditions and variants. There have even been recent calls for the NIH or even the FDA itself to curate representative and diverse datasets for AI/ML training, specifically.
And, an executive order from the White House last October created a Department of Health and Human Services task force mandated to incorporate equity principles in the AI-enabled technologies used in health.
In the order, President Joe Biden also highlighted NIH's AIM-AHEAD program and admonished HHS to accelerate grants to this consortium and showcase it in underserved communities.
From inhuman to nonhuman
"Of all the forms of inequality, injustice in health is the most shocking and the most inhuman because it often results in physical death," Martin Luther King Jr. said in 1966. According to a contemporary newspaper report, King was urging creative nonviolence against a "conspiracy of inaction" at the American Medical Association, which he said enabled racial bias in medicine to persist. Almost 60 years later, the bias continues to persist, and a nonhuman intelligence is being brought to bear.
But simply deleting race information from data used to train AL/ML devices doesn't always remove bias.
For example, an AI/ML-enabled clinical tool was developed by Optum and implemented for 200 million US patients. Meant to determine who was at highest health risk and thus eligible for referrals to specialty services, the tool had been trained on records that excluded race but included each patient's lifetime cost of treatment.
Because the US health system spends more on white patients on average, cost turned out to be a proxy for race. Researchers calculated that, absent this bias, 47 percent of Black patients should have been cleared to get extra care, while in practice it was offered to approximately 18 percent.
This tool potentially impacted healthcare for millions of Black people, but scrutinizing it was only possible because researchers had rare access to the underlying datasets.
In the case of the Cancer Genome Atlas, researchers have also found that de-identified digital pathology images had enough visual clues for AI/ML models to group them by lab. The AI/ML then extrapolated from site-specific information to race, leading to bias.
Similarly, AI/ML has been found to see features no human can detect in de-identified chest X-rays to accurately predict race.
Paved with good intentions
Including information on race is also sometimes helpful when developing AI/ML-enabled approaches.
For example, one group found that factors related to race impact the amount of pain a patient experiences from knee osteoarthritis. Reports of pain among Black patients had historically been more likely to be dismissed if the radiographs didn't show severe degeneration. But AI/ML trained on a racially and socioeconomically diverse dataset of images was able to "see" something in the X-rays and more accurately predict patients' own pain experiences.
On the other hand, diagnostic and practice algorithms that adjust or "correct" their outputs based on a patient's race can also be dangerous, according to a New England Journal of Medicine article.
Authors identified 13 clinical tools across eight medical subfields in which race was "hidden in plain sight" in risk prediction algorithms, and suggested these may perpetuate and amplify race-based health inequities.
For example, an obstetrician treating a woman who previously delivered a baby via cesarean section would likely use a vaginal birth after cesarean, or VBAC, calculator to estimate her likelihood of success. This algorithm, however, originally included an ambiguous association between success and race, resulting in biased treatment.
Wong and her colleagues are using AI/ML approaches to improve upon the VBAC calculator. Since the unbiased ground truth on risk and race is not known, her team trained the AI/ML on data from the top 30 percent of physicians who have the most successful outcomes across all races.
"It is the equivalent of an angel sitting on your shoulder, or an experienced friend encouraging you to keep going," Wong said.
Racial disparities in diagnosing and managing preeclampsia and severe bleeding contribute to a threefold higher risk of maternal death for Black women. Here too, the team has developed AI/ML-based approaches to eliminate differences in prescribing aspirin for preeclampsia risk and to use real-time data to gauge bleeding risk during delivery.
Unfortunately, improving equity in one aspect of care doesn't necessarily help with other areas.
For example, while blood-based testing promises to catch preeclampsia cases earlier, differential access — related to capital expense and batch processing that favors using these tests at large medical centers — could impact postmarket assessment, adding to bias.
"My excitement around AI is really being able to rapidly democratize care," Wong said. "My fear is that it will do so in a disparate way, where the haves just get more."
This is why, to Landry, scrutinizing AI/ML for bias has no endpoint, and there should be no expectations of a single, one-time fix. Instead, monitoring and evaluation programs must be put in place, and maintained in perpetuity.
With commercial AI/ML-enabled products, however, Wong suggested proprietary data might make this particularly challenging. "The VBAC calculator in its original form was racist, but at least you could tell it was racist," she said, while "algorithms have the potential to be racist under the hood, in ways that we can't unpack."
Including the patients
Considering patient desires from the very beginning of AI/ML development is critical to overcoming centuries of imbalanced data collection.
Patients are seldom heard or engaged, the authors of the UK review of racial bias in healthcare said, and prior efforts "have often been tokenistic and have not supported people to participate in a co-design process through the device lifecycle."
Wong said including diverse patient stakeholders can correct for the "research tourism" of the past and potentially heal the "earned medical mistrust."
Centers at academic institutions like Tulane University and the Morehouse School of Medicine, among others, are now attempting to involve diverse communities in all stages of AI/ML development using community-based participatory research, or CBPR.
But authentically working with patient-stakeholders from the outset — as has also been advocated by the Agency for Healthcare Research and Quality (AHRQ) and the National Institute on Minority Health and Health Disparities — requires acknowledging that participating in research is work.
In CBPR, patients are compensated fairly, credited with authorship, and shown respect by being entrusted to present data publicly.
All this takes time, money, and a mindset that values the community over the individual.
The work is not without reward, however, as Wong said patients' insights are often contrary to established beliefs among researchers.
For example, where clinicians tend to focus on increasing education and access, patients see paths to equity in things like building awareness and trust.
Similarly, attendees of AIM-AHEAD listening sessions advocated for storytelling as a powerful tool to communicate the impact of AI/ML in promoting health equity.
CBPR offers communities a meaningful and valued seat at the table in this long-term endeavor, and this will in turn reduce bias in AI/ML-enabled devices.Though it may seem daunting, scientific endeavors have a long history of previously inconceivable levels of complexity and ambition, Landry said.
"We sequenced the human genome — of course we can engage communities in research and AI development," she said.