Menu

GPT-4 Can Perpetuate Racial and Gender Bias in Health Care

Author: Brigham and Women's Hospital
Published: 18 Dec 2023 - Updated: 7 Jun 2026
Publication Details: Peer-Reviewed, Observational Study

Contents: Synopsis - Definition - Introduction - Main - Insights, Updates - Related Publications

Synopsis: This research, a peer-reviewed observational study from Brigham and Women's Hospital published in The Lancet Digital Health, examined whether the large language model GPT-4 encodes racial and gender biases when used to support clinical decision making. Investigators tested the model across four roles - generating patient vignettes, building differential diagnoses, creating treatment plans, and assessing subjective patient traits - and documented measurable disparities tied to a patient's race, ethnicity, or gender. The findings carry weight because they come from a founding member of the Mass General Brigham system and rely on established medical education tools, making them useful for clinicians, patients, seniors, and people with disabilities who increasingly encounter AI-assisted care and need to understand where these systems can fall short.*

At a Glance

Topic Definition: Algorithmic Bias in Health Care

Algorithmic bias in health care refers to the systematic and repeatable errors that arise when an automated system, such as a large language model or other machine learning tool, produces outcomes that unfairly differ across groups defined by race, ethnicity, gender, or other characteristics. These biases often stem from patterns embedded in the data used to train the model, which can reflect and then amplify existing inequities in medical research and clinical practice. In a clinical decision support setting, such bias can shape diagnoses, treatment recommendations, and judgments about a patient's character or pain, making careful evaluation essential before these systems are relied upon in patient care.

Introduction

"Assessing the Potential of Gpt-4 to Perpetuate Racial and Gender Biases in Health Care: A Model Evaluation Study" - The Lancet Digital Health.

Large language models (LLMs) like ChatGPT and GPT-4 have the potential to assist in clinical practice to automate administrative tasks, draft clinical notes, communicate with patients, and even support clinical decision making. However, preliminary studies suggest the models can encode and perpetuate social biases that could adversely affect historically marginalized groups. A new study by investigators from Brigham and Women's Hospital, a founding member of the Mass General Brigham healthcare system, evaluated the tendency of GPT-4 to encode and exhibit racial and gender biases in four clinical decision support roles. Their results are published in The Lancet Digital Health.

Main Content

"While most of the focus is on using LLMs for documentation or administrative tasks, there is also excitement about the potential to use LLMs to support clinical decision making," said corresponding author Emily Alsentzer, PhD, a postdoctoral researcher in the Division of General Internal Medicine at Brigham and Women's Hospital. "We wanted to systematically assess whether GPT-4 encodes racial and gender biases that impact its ability to support clinical decision making."

Testing

Alsentzer and colleagues tested four applications of GPT-4 using the Azure OpenAI platform. First, they prompted GPT-4 to generate patient vignettes that can be used in medical education. Next, they tested GPT-4's ability to correctly develop a differential diagnosis and treatment plan for 19 different patient cases from a NEJM Healer, a medical education tool that presents challenging clinical cases to medical trainees. Finally, they assessed how GPT-4 makes inferences about a patient's clinical presentation using eight case vignettes that were originally generated to measure implicit bias. For each application, the authors assessed whether GPT-4's outputs were biased by race or gender.

For the medical education task, the researchers constructed ten prompts that required GPT-4 to generate a patient presentation for a supplied diagnosis. They ran each prompt 100 times and found that GPT-4 exaggerated known differences in disease prevalence by demographic group.

"One striking example is when GPT-4 is prompted to generate a vignette for a patient with sarcoidosis: GPT-4 describes a Black woman 81% of the time," Alsentzer explains. "While sarcoidosis is more prevalent in Black patients and in women, it's not 81% of all patients."

Next, when GPT-4 was prompted to develop a list of 10 possible diagnoses for the NEJM Healer cases, changing the gender or race/ethnicity of the patient significantly affected its ability to prioritize the correct top diagnosis in 37% of cases.

"In some cases, GPT-4's decision making reflects known gender and racial biases in the literature," Alsentzer said. "In the case of pulmonary embolism, the model ranked panic attack/anxiety as a more likely diagnosis for women than men. It also ranked sexually transmitted diseases, such as acute HIV and syphilis, as more likely for patients from racial minority backgrounds compared to white patients."

When asked to evaluate subjective patient traits such as honesty, understanding, and pain tolerance, GPT-4 produced significantly different responses by race, ethnicity, and gender for 23% of the questions. For example, GPT-4 was significantly more likely to rate Black male patients as abusing the opioid Percocet than Asian, Black, Hispanic, and white female patients when the answers should have been identical for all the simulated patient cases.

Limitations of the current study include testing GPT-4's responses using a limited number of simulated prompts and analyzing model performance using only a few traditional categories of demographic identities. Future work should investigate biases using clinical notes from the electronic health record.

"While LLM-based tools are currently being deployed with a clinician in the loop to verify the model's outputs, it is very challenging for clinicians to detect systemic biases when viewing individual patient cases," Alsentzer said. "It is critical that we perform bias evaluations for each intended use of LLMs, just as we do for other machine learning models in the medical domain. Our work can help start a conversation about GPT-4's potential to propagate bias in clinical decision support applications."

Authorship:

Additional BWH authors include Jorge A Rodriguez, David W Bates, and Raja-Elie E Abdulnour. Additional authors include Travis Zack, Eric Lehman, Mirac Suzgun, Leo Anthony Celi, Judy Gichoya, Dan Jurafsky, Peter Szolovits, and Atul J Butte.

Disclosures:

Alsentzer reports personal fees from Canopy Innovations, Fourier Health, and Xyla; and grants from Microsoft Research. Abdulnour is an employee of Massachusetts Medical Society, which owns NEJM Healer (NEJM Healer cases were used in the study). Additional author disclosures can be found in the paper.

Funding:

T32 NCI Hematology/Oncology Training Fellowship; Open Philanthropy and the National Science Foundation (IIS-2128145); and a philanthropic gift from Priscilla Chan and Mark Zuckerberg.

Paper Cited:

Zack, T; Lehman, E et al. "Assessing the potential of GPT-4 to perpetuate racial and gender biases in healthcare: a model evaluation study" The Lancet Digital Health.

Insights, Analysis, and Developments

Editorial Note: The value of this study lies less in condemning the technology than in showing how quietly bias can slip into tools that look objective on the surface. Because a clinician reviewing a single patient at a time has almost no way to spot a systemic pattern, the authors make a reasonable case that every intended medical use of these models deserves its own bias audit, much like any other clinical instrument. As AI moves closer to the bedside, work like this offers a practical reminder that trustworthy care depends on testing the assumptions built into the software, not just the judgment of the people using it.*

Attribution/Source(s): This peer reviewed publication was selected for publishing by the editors of Disabled World (DW) due to its relevance to the disability community. Originally authored by Brigham and Women's Hospital and published on 18 Dec 2023, this content may have been edited for style, clarity, or brevity.

* Editorial additions by Ian C. Langtree.

How AI Efficiency is Turning Diversity into a Liability

AI systems trained on norms risk excluding those with non-linear lives - people with disabilities, caregivers, migrants - by treating diversity as inefficiency. Published: 2 Feb 2026.

AI-Powered Scams: The New Frontier of Fraud

Learn how AI-powered scams including voice synthesis and deepfakes target vulnerable populations, with particular risks for seniors and individuals with disabilities. Published: 25 Jan 2026.

Artificial Intelligence in Drug Development: Transforming the Future of Medicine

AI accelerates drug discovery, offering breakthrough treatments for age-related diseases, rare conditions, and disabilities through personalized medicine. Published: 14 Jan 2026.

ChatGPT Health: AI-Assisted Healthcare and Its Promise for Seniors and People with Disabilities

ChatGPT Health offers 24/7 medical information access that could transform healthcare for seniors and people with disabilities facing traditional barriers. Published: 8 Jan 2026.

Wukong: China's Darwin Monkey Neuromorphic Supercomputer

China's Darwin Monkey neuromorphic supercomputer mimics a macaque brain with 2 billion neurons, promising efficient AI but facing skepticism about practical advantages. Published: 13 Oct 2025.

AI Delegation Increases Dishonest Human Behavior

New research shows people are more likely to behave dishonestly when delegating tasks to AI, with machines complying with unethical instructions. Published: 17 Sep 2025.

View the Full List of Related Publications

What People Are Saying

Start, or join, thought-provoking conversations with other Disabled World readers on this topic.

Share and Comment

Permalink:

<a href="https://www.disabled-world.com/assistivedevices/ai/clinical-decisions.php">GPT-4 Can Perpetuate Racial and Gender Bias in Health Care</a>: A Brigham and Women's Hospital study finds GPT-4 produced biased clinical responses by race or gender in up to 37 percent of cases used in medical decision support.

While we strive to provide accurate, up-to-date information, our content is for general informational purposes only. Please consult qualified professionals for advice specific to your situation.