Real-world examples of healthcare AI bias

Written by Gugu Ntsele | May 11, 2025

Artificial intelligence promises to revolutionize healthcare by improving diagnostics, streamlining administrative processes, and personalizing treatment plans. A 2024 Harvard Medical School article titled, "The Benefits of the Latest AI Technologies for Patients and Clinicians," highlighted several advantages of AI in healthcare: its ability to help clinicians better interpret imaging results, its potential to assist healthcare organizations in improving quality and safety, and its capacity to aid in the diagnosis and treatment of rare diseases. However, beneath this promise is the possibility that AI can cause and worsen existing healthcare disparities through algorithmic bias.

According to research published by Boston University, titled "AI In Healthcare: Counteracting Algorithmic Bias," algorithmic bias can be defined as "inequality of algorithmic outcomes between two groups of different morally relevant reference classes such as gender, race, or ethnicity. Algorithmic bias occurs when the outcome of the algorithm's decision-making treats one group better or worse without good cause." This definition shows the ethical concern of using potentially biased AI in healthcare.

When AI systems make biased recommendations, they can directly impact patient care, leading to misdiagnosis, inappropriate treatments, or denied access to necessary interventions for marginalized populations. What makes this troubling is that these biases often operate invisibly, masked by the perceived objectivity of technology.

As healthcare increasingly relies on AI to inform critical decisions, understanding and addressing algorithmic bias becomes not just a technical challenge but an ethical imperative. Today, we explore real-world examples of healthcare AI bias and outline approaches to creating more equitable AI systems that serve all patients fairly.

Risk prediction algorithms

A widely used commercial algorithm affecting millions of patients demonstrated racial bias in identifying patients for high-risk care management programs. The algorithm, developed by Optum and used by U.S. insurers and hospitals, was designed to predict healthcare costs rather than actual illness severity. Since less money is historically spent on Black patients with similar conditions, the algorithm underestimated their care needs.

As reported in a 2019 Science study led by Dr. Ziad Obermeyer of the University of California, Berkeley, the software regularly recommended healthier white patients for healthcare risk management programs ahead of sicker Black patients simply because those white patients were projected to be more costly. At the hospital studied, Black patients cost $1,800 less per year than white patients with the same number of chronic illnesses—a pattern observed across the United States.

The Boston University research highlighted this case study: "Obermeyer studied the accuracy of a healthcare algorithm that excluded social category classifiers... [They] found that since social category information, such as race, was excluded from the dataset the algorithm was trained and deployed on, the algorithm unintentionally used healthcare cost as a proxy variable for race. Black patients systemically have lower healthcare costs because there is unequal access to care for black and white patients and less money is spent on treatment of black patients."

When researchers recalibrated the algorithm using direct measures of health instead of costs, "The racial bias nearly disappeared, and the percentage of Black patients identified for additional care increased from 17.7% to 46.5%." This correction shows how algorithmic bias can affect care allocation.

While Optum called the findings "misleading," arguing that hospitals should supplement their cost algorithm with socioeconomic data and physician expertise, the study reveals how algorithms may cause disparities when cost is used as a proxy for medical need. As Princeton researcher Ruha Benjamin noted, such systems risk creating a "New Jim Code" that can "hide, speed and deepen racial discrimination behind a veneer of technical neutrality."

Diagnostic systems

Research published in the Lancet Digital Health revealed a problem with AI systems being developed for skin cancer diagnosis. Dr. David Wen and colleagues from the University of Oxford examined 21 open-access datasets used to train AI algorithms for skin cancer detection and found severe underrepresentation of darker skin tones.

Of 106,950 total images across these datasets, only 2,436 had skin type recorded. Among these, just 10 images were from people with brown skin and only one was from an individual with dark brown or black skin. Even more concerning, of the 1,585 images that contained ethnicity-related data, "No images were from individuals with an African, African-Caribbean or South Asian background," the researchers reported.

This lack of diversity in training data creates a risk that AI diagnostic tools will be less accurate for people with darker skin. As Dr. Wen explained: "You could have a situation where the regulatory authorities say that because this algorithm has only been trained on images in fair-skinned people, you're only allowed to use it for fair-skinned individuals, and therefore that could lead to certain populations being excluded from algorithms that are approved for clinical use."

The alternative scenario is equally problematic: if these biased algorithms are approved for use across all populations, they "may not perform as accurately on populations who don't have that many images involved in training." This could result in misdiagnosis leading to "avoidable surgery, missing treatable cancers and causing unnecessary anxiety," particularly for patients with darker skin.

Professor Charlotte Proby, a dermatology expert at the University of Dundee, emphasized that the "failure to train AI tools using images from darker skin types may impact their reliability for assessment of skin lesions in skin of colour," with potentially wide-ranging implications for healthcare equity.

The challenge with these systems relates to the concern mentioned in Bias in medical AI, which states that "AI models trained on potentially biased labels may perpetuate and amplify not only differential misclassifications and substandard care practices based on these social factors, but also the original cognitive biases in its own predictions and recommendations."

Warfarin dosing algorithms and genetic bias

Genetic algorithms for warfarin dosing provide another example of healthcare AI bias with direct clinical implications. A study titled Poor Warfarin Dose Prediction with Pharmacogenetic Algorithms that Exclude Genotypes Important for African Americans showed that widely used pharmacogenetic algorithms for warfarin dosing perform poorly in African Americans because they fail to account for critical genetic variations.

Their research revealed that algorithms used in clinical trials like the Clarification of Optimal Anticoagulation through Genetics (COAG) miscalculated appropriate dosing for African Americans.

The consequences of this algorithmic bias were evident in clinical trials. While the European Pharmacogenetics of Anticoagulation Therapy (EU-PACT) trial showed benefits from genetic dosing in its homogeneous European population, the COAG trial with its more diverse population found that "African Americans, who comprised approximately one-third of the COAG trial population, did worse with pharmacogenetic dosing, with a higher likelihood of supratherapeutic INR values with pharmacogenetic versus clinically based dosing."

When the researchers adjusted the algorithms to account for these African-specific genetic variants, they found that "the racial bias nearly disappeared." This demonstrates how important it is to include diverse populations in algorithm development, particularly for potentially life-saving treatments like warfarin.

The researchers concluded, "Our data indicates that, when dosing warfarin based on genotype, it is important to account for variants that are either common or specifically influence warfarin response in African Americans and that not doing so can lead to significant overdosing in a large portion of the African American population."

Bias in medical devices

A 2024 UK government-commissioned review titled "Equity in Medical Devices: Independent Review" found that minority ethnic people, women, and people from deprived communities are at risk of poorer healthcare because of biases within medical tools and devices.

The review confirmed concerns that pulse oximeters overestimate the amount of oxygen in the blood of people with dark skin. While there was no evidence of this affecting care in the NHS, studies in the US have shown such biases leading to delayed diagnosis and treatment, worse organ function, and higher mortality in Black patients.

The UK report also highlighted concerns about AI-based medical devices:

Potential to exacerbate the under-diagnosis of cardiac conditions in women
Risk of discrimination based on patients' socioeconomic status
Likelihood of under-diagnosing skin cancers in people with darker skin tones due to AI systems being trained predominantly on images of lighter skin

The report noted problems with polygenic risk scores used to assess individual disease risk based on genetic factors: "Major genetic datasets that polygenic risk scores use are overwhelmingly on people of European ancestry, which means that they may not be applicable to people of other ancestries," according to Professor Enitan Carrol of the University of Liverpool.

Even attempts to correct biases can create new problems. The report highlighted how race-based corrections applied to spirometer measurements (devices used to assess lung function and diagnose respiratory conditions) have themselves been found to contain biases.

Addressing healthcare algorithmic bias

Addressing algorithmic bias in healthcare requires coordinated effort across multiple dimensions, integrating approaches from technical solutions, policy frameworks, clinical practice, and open science principles. Based on research published by the NIH, the following strategies represent the most promising paths forward:

1. Technical solutions

Technical approaches to mitigating algorithmic bias must be implemented throughout the AI development lifecycle:

Diverse and representative training data: Developing datasets that include adequate representation from all demographic groups is important for ensuring algorithms perform equitably across populations. As noted in the NIH research article, "When AI algorithms are trained with datasets in which vulnerable groups are not well represented, their predictive value may be limited".
Bias detection tools and frameworks: Implementing testing frameworks that specifically identify performance disparities across protected characteristics before deployment. The NIH research recommends "alternative metrics... that are attuned to class imbalance, such as the F1 score" and "permuting the labels of the available samples and retraining an algorithm to give 'random' predictions [to] provide an empirical estimation of chance levels."
Algorithmic fairness constraints: Building fairness metrics directly into algorithm development to optimize not just for overall performance but for equitable performance across demographic groups. As the NIH research suggests, "protected attributes, such as gender or ethnicity, can be included during training in order to ensure that algorithmic predictions are statistically independent from these attributes."
Explainable AI models: Developing more interpretable AI that allows stakeholders to understand how decisions are reached and identify potential sources of bias. The NIH research emphasizes that "explainable AI includes interpretable AI models, where the strengths and weaknesses of a decision-making process are transparent."

2. Policy and regulatory frameworks

AI regulations in healthcare are changing and must incorporate fairness considerations:

Regulatory oversight: Agencies like the FDA must develop frameworks for regulating AI as medical devices, including requirements for bias evaluation across demographic groups. This includes "evaluating AI algorithms as rigorously as other healthcare care interventions, like clinical trials".
Algorithmic impact assessments: Mandatory pre-deployment evaluations of potential discriminatory impacts should be required, similar to environmental impact assessments for infrastructure projects.
Standardized performance reporting: Requiring transparent reporting of algorithm performance across demographic subgroups facilitates comparisons and identifies disparities. The NIH research article emphasizes that "the metrics should not only focus on the numerical accuracy, but also include quality of care and patient outcomes."
Data standards and interoperability: Creating consistent data standards that support demographic data collection and interoperability between systems. As the NIH research notes, "Data standards lead to efficient data infrastructure and support interoperability. Shared formatting, language, and data identifiers make information scalable."

3. Clinical integration and education

Technical solutions alone are insufficient without clinical integration:

Human-AI collaboration models: Designing systems that leverage both algorithmic strengths and human clinical judgment, particularly for high-stakes decisions. The NIH research notes that "labels associated with medical data, disease rating scales, and diagnosis may be imbued with cognitive biases of the health care personnel who collected this information."
Clinician education on algorithmic bias: Training healthcare providers to evaluate algorithmic recommendations and recognize potential biases. This addresses what the NIH research describes as "human bias in AI [which] can be one of the hardest ones to detect and mitigate."
Patient empowerment: Ensuring patients understand when AI influences their care and have mechanisms to question or challenge algorithmically derived decisions.
Diverse development teams: Building diverse teams of clinicians, data scientists, ethicists, and patient advocates to develop healthcare AI systems that consider varied perspectives. The NIH research emphasizes that "lack of diversity in engineering and biomedical teams can replicate unconscious bias and power imbalances."

4. Open science and participatory approaches

The NIH research article strongly advocates for open science practices to address bias:

Participant-centered development: Including communities in the design of data collection and AI deployment to ensure outcomes benefit them directly. The NIH research emphasizes that "when members of underrepresented groups are actively engaged in science, they can contribute to the identification of bias against their communities."
Responsible data-sharing frameworks: Creating systems that enable data sharing while protecting privacy. The NIH research suggests approaches like "federated learning systems, which enable the training of AI algorithms at a local level, allowing individuals to maintain control and anonymity of their data."
Open-source AI algorithms: Sharing code allows researchers worldwide to evaluate algorithm performance on diverse populations. The NIH research notes this can "enable local research communities to validate and fine-tune existing neural networks for the needs of their local patient groups."
Common metrics for AI reliability: Adopting standardized metrics for assessing both performance and fairness. The NIH research recommends "not only focus on the numerical accuracy, but also include quality of care and patient outcomes."

As the NIH research article concludes: "In order for new technologies to be inclusive, they need to be accurate and representative of the needs of diverse populations." By implementing these multifaceted strategies, healthcare can work toward AI systems that serve all patients equitably, regardless of race, gender, socioeconomic status, or other characteristics.

FAQs

Why do AI systems in healthcare become biased?

AI systems can become biased if trained on unrepresentative or incomplete data, reflecting existing healthcare disparities.

How does algorithmic bias differ from human bias in medicine?

Algorithmic bias is often less visible and harder to detect, as it is embedded in automated decision-making systems.

What role do social factors play in healthcare AI bias?

Social factors, like income and race, can indirectly influence AI outcomes if used as proxies in models without appropriate safeguards.

Are there examples of bias in mental health AI tools?

Yes, some AI tools used for mental health screening have been shown to misclassify symptoms based on cultural or linguistic differences.

How can healthcare providers reduce AI bias?

By using diverse training data, conducting regular bias audits, and integrating clinician oversight into AI workflows.

View full post