Publication Title
PLOS Digital Health
Volume
4
Page
e0000807
Year
2025
Abstract
Abstract
The rapid integration of artificial intelligence (AI) into healthcare has raised many concerns about race bias in AI models. Yet, overlooked in this dialogue is the lack of quality control for the accuracy of patient race and ethnicity (r/e) data in electronic health records (EHR). This article critically examines the factors driving inaccurate and unrepresentative r/e datasets. These include conceptual uncertainties about how to categorize races and ethnicity, shortcomings in data collection practices, EHR standards, and the misclassification of patients’ race or ethnicity. To address these challenges, we propose a two-pronged action plan. First, we present a set of best practices for healthcare systems and medical AI researchers to improve r/e data accuracy. Second, we call for developers of medical AI models to transparently warrant the quality of their r/e data. Given the ethical and scientific imperatives of ensuring high-quality r/e data in AI-driven healthcare, we argue that these steps should be taken immediately.
Author summary
Healthcare systems are increasingly using artificial intelligence (AI) to improve clinical care in various settings such as hospitals and patient care facilities. In this paper, we discuss how these AI systems may be trained using inaccurate and incomplete patient race and ethnicity data. We identify several key issues underlying this data quality problem: the conceptual challenges in defining race and ethnicity categories, inconsistent data collection practices across healthcare facilities, and frequent errors in classifying patients. These problems create unreliable training data that undermines efforts to avoid and correct biases within these medical AI tools. To address these challenges, we propose two practical solutions. First, hospitals should adopt best practices for collecting race and ethnicity information, including patient self-reporting, staff training, and transparent processes. Second, developers of medical AI should be required to disclose the quality and sources of the demographic data used to train their models. Our work emphasizes that discussions about fairness in medical AI must include attention to the quality of race and ethnicity data. As these technologies become more widespread in healthcare, ensuring they work effectively for all patients requires addressing these fundamental data issues.
Recommended Citation
Alexandra Tsalidis, Lakshmi Bharadwaj, and Francis X. Shen, Standardization and Accuracy of Race and Ethnicity Data: Equity Implications for Medical AI, 4 PLOS Digital Health e0000807 (2025), available at https://scholarship.law.umn.edu/faculty_articles/1161.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.
