Leveraging Synthetic Data as a Tool to Combat Bias in Artificial Intelligence (AI) Model Training

Jumai Adedoja Fabuyi *

University of Illinois Urbana Champaign, 901 West Illinois Street, Urbana, IL 61801, United States.

*Author to whom correspondence should be addressed.


Abstract

This study investigates the efficacy of synthetic data in mitigating bias in artificial intelligence (AI) model training, focusing on demographic inclusivity and fairness. Using Generative Adversarial Networks (GANs), synthetic datasets were generated from the UCI Adult Dataset, COMPAS Recidivism Dataset, and MIMIC-III Clinical Database. Logistic regression models were trained on both synthetic and original datasets to evaluate fairness metrics and predictive accuracy. Fairness was assessed through demographic parity and equality of opportunity, which measure balanced prediction rates and equitable outcomes across demographic groups. Fidelity and data diversity were evaluated using statistical tests such as Kolmogorov-Smirnov (KS) and Kullback-Leibler (KL) divergence, along with the Inception Score, which quantifies diversity in synthetic data. The results revealed significant fairness improvements for models trained on synthetic datasets. For the COMPAS dataset, demographic parity increased from 0.72 to 0.89, and equality of opportunity rose from 0.65 to 0.83, without compromising predictive accuracy (0.82 AUC-ROC compared to 0.83 for original data). Based on the findings, this research recommends employing GANs for generating synthetic data in bias-sensitive domains to enhance demographic inclusivity and ensure equitable outcomes in AI models. Furthermore, integrating human-in-the-loop (HITL) systems is critical to monitor and address residual biases during data generation. Standardized validation frameworks, including fairness metrics and fidelity tests, should be adopted to ensure transparency and consistency across applications. These practices can enable organizations to leverage synthetic data effectively while maintaining ethical standards in AI development and deployment.

Keywords: Synthetic data, bias mitigation, GANs, demographic parity, AI ethics


How to Cite

Fabuyi, Jumai Adedoja. 2024. “Leveraging Synthetic Data As a Tool to Combat Bias in Artificial Intelligence (AI) Model Training”. Journal of Engineering Research and Reports 26 (12):24-46. https://doi.org/10.9734/jerr/2024/v26i121337.