GPT-4 Can Perpetuate Racial and Gender Bias in Health Care
Author: Brigham and Women's Hospital
Published: 18 Dec 2023 - Updated: 7 Jun 2026
Publication Details: Peer-Reviewed, Observational Study
Contents: Synopsis - Definition - Introduction - Main - Insights, Updates - Related Publications
Synopsis: This research, a peer-reviewed observational study from Brigham and Women's Hospital published in The Lancet Digital Health, examined whether the large language model GPT-4 encodes racial and gender biases when used to support clinical decision making. Investigators tested the model across four roles - generating patient vignettes, building differential diagnoses, creating treatment plans, and assessing subjective patient traits - and documented measurable disparities tied to a patient's race, ethnicity, or gender. The findings carry weight because they come from a founding member of the Mass General Brigham system and rely on established medical education tools, making them useful for clinicians, patients, seniors, and people with disabilities who increasingly encounter AI-assisted care and need to understand where these systems can fall short.*
At a Glance
- 1 - When prompted to generate a patient vignette for sarcoidosis, GPT-4 described the patient as a Black woman 81 percent of the time, exaggerating real but far smaller demographic differences in disease prevalence.
- 2 - The model ranked panic attack or anxiety as a more likely diagnosis for women than men in a pulmonary embolism case, and rated certain infections as more likely for patients from racial minority backgrounds.
- 3 - GPT-4 was significantly more likely to rate Black male patients as abusing the opioid Percocet than female patients of other backgrounds, even though the simulated cases were identical.
- Topic Definition: Algorithmic Bias in Health Care
Algorithmic bias in health care refers to the systematic and repeatable errors that arise when an automated system, such as a large language model or other machine learning tool, produces outcomes that unfairly differ across groups defined by race, ethnicity, gender, or other characteristics. These biases often stem from patterns embedded in the data used to train the model, which can reflect and then amplify existing inequities in medical research and clinical practice. In a clinical decision support setting, such bias can shape diagnoses, treatment recommendations, and judgments about a patient's character or pain, making careful evaluation essential before these systems are relied upon in patient care.
Introduction
"Assessing the Potential of Gpt-4 to Perpetuate Racial and Gender Biases in Health Care: A Model Evaluation Study" - The Lancet Digital Health.
Large language models (LLMs) like ChatGPT and GPT-4 have the potential to assist in clinical practice to automate administrative tasks, draft clinical notes, communicate with patients, and even support clinical decision making. However, preliminary studies suggest the models can encode and perpetuate social biases that could adversely affect historically marginalized groups. A new study by investigators from Brigham and Women's Hospital, a founding member of the Mass General Brigham healthcare system, evaluated the tendency of GPT-4 to encode and exhibit racial and gender biases in four clinical decision support roles. Their results are published in The Lancet Digital Health.
Main Content
"While most of the focus is on using LLMs for documentation or administrative tasks, there is also excitement about the potential to use LLMs to support clinical decision making," said corresponding author Emily Alsentzer, PhD, a postdoctoral researcher in the Division of General Internal Medicine at Brigham and Women's Hospital. "We wanted to systematically assess whether GPT-4 encodes racial and gender biases that impact its ability to support clinical decision making."
Testing
Alsentzer and colleagues tested four applications of GPT-4 using the Azure OpenAI platform. First, they prompted GPT-4 to generate patient vignettes that can be used in medical education. Next, they tested GPT-4's ability to correctly develop a differential diagnosis and treatment plan for 19 different patient cases from a NEJM Healer, a medical education tool that presents challenging clinical cases to medical trainees. Finally, they assessed how GPT-4 makes inferences about a patient's clinical presentation using eight case vignettes that were originally generated to measure implicit bias. For each application, the authors assessed whether GPT-4's outputs were biased by race or gender.
For the medical education task, the researchers constructed ten prompts that required GPT-4 to generate a patient presentation for a supplied diagnosis. They ran each prompt 100 times and found that GPT-4 exaggerated known differences in disease prevalence by demographic group.
"One striking example is when GPT-4 is prompted to generate a vignette for a patient with sarcoidosis: GPT-4 describes a Black woman 81% of the time," Alsentzer explains. "While sarcoidosis is more prevalent in Black patients and in women, it's not 81% of all patients."
Next, when GPT-4 was prompted to develop a list of 10 possible diagnoses for the NEJM Healer cases, changing the gender or race/ethnicity of the patient significantly affected its ability to prioritize the correct top diagnosis in 37% of cases.
"In some cases, GPT-4's decision making reflects known gender and racial biases in the literature," Alsentzer said. "In the case of pulmonary embolism, the model ranked panic attack/anxiety as a more likely diagnosis for women than men. It also ranked sexually transmitted diseases, such as acute HIV and syphilis, as more likely for patients from racial minority backgrounds compared to white patients."
When asked to evaluate subjective patient traits such as honesty, understanding, and pain tolerance, GPT-4 produced significantly different responses by race, ethnicity, and gender for 23% of the questions. For example, GPT-4 was significantly more likely to rate Black male patients as abusing the opioid Percocet than Asian, Black, Hispanic, and white female patients when the answers should have been identical for all the simulated patient cases.
Limitations of the current study include testing GPT-4's responses using a limited number of simulated prompts and analyzing model performance using only a few traditional categories of demographic identities. Future work should investigate biases using clinical notes from the electronic health record.
"While LLM-based tools are currently being deployed with a clinician in the loop to verify the model's outputs, it is very challenging for clinicians to detect systemic biases when viewing individual patient cases," Alsentzer said. "It is critical that we perform bias evaluations for each intended use of LLMs, just as we do for other machine learning models in the medical domain. Our work can help start a conversation about GPT-4's potential to propagate bias in clinical decision support applications."
Authorship:
Additional BWH authors include Jorge A Rodriguez, David W Bates, and Raja-Elie E Abdulnour. Additional authors include Travis Zack, Eric Lehman, Mirac Suzgun, Leo Anthony Celi, Judy Gichoya, Dan Jurafsky, Peter Szolovits, and Atul J Butte.
Disclosures:
Alsentzer reports personal fees from Canopy Innovations, Fourier Health, and Xyla; and grants from Microsoft Research. Abdulnour is an employee of Massachusetts Medical Society, which owns NEJM Healer (NEJM Healer cases were used in the study). Additional author disclosures can be found in the paper.
Funding:
T32 NCI Hematology/Oncology Training Fellowship; Open Philanthropy and the National Science Foundation (IIS-2128145); and a philanthropic gift from Priscilla Chan and Mark Zuckerberg.
Paper Cited:
Zack, T; Lehman, E et al. "Assessing the potential of GPT-4 to perpetuate racial and gender biases in healthcare: a model evaluation study" The Lancet Digital Health.
Insights, Analysis, and Developments
Editorial Note: The value of this study lies less in condemning the technology than in showing how quietly bias can slip into tools that look objective on the surface. Because a clinician reviewing a single patient at a time has almost no way to spot a systemic pattern, the authors make a reasonable case that every intended medical use of these models deserves its own bias audit, much like any other clinical instrument. As AI moves closer to the bedside, work like this offers a practical reminder that trustworthy care depends on testing the assumptions built into the software, not just the judgment of the people using it.*Attribution/Source(s): This peer reviewed publication was selected for publishing by the editors of Disabled World (DW) due to its relevance to the disability community. Originally authored by Brigham and Women's Hospital and published on 18 Dec 2023, this content may have been edited for style, clarity, or brevity.
* Editorial additions by Ian C. Langtree.