Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

The Boundaries of Fair AI in Medical Image Prognosis: A Causal Perspective

Authors: Thai-Hoang Pham, Jiayuan Chen, Seungyeon Lee, Yuanlong Wang, Sayoko Moroi, Xueru Zhang, Ping Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our large-scale evaluation reveals that bias is pervasive across different imaging modalities and that current fairness methods offer limited mitigation. We further demonstrate a strong association between underlying bias sources and model disparities, emphasizing the need for holistic approaches that target all forms of bias. Notably, we find that fairness becomes increasingly difficult to maintain under distribution shifts, underscoring the limitations of existing solutions and the pressing need for more robust, equitable prognostic models.
Researcher Affiliation	Academia	1Department of Computer Science and Engineering, The Ohio State University 2Department of Biomedical Informatics, The Ohio State University 3Department of Ophthalmology and Visual Sciences, The Ohio State University EMAIL EMAIL
Pseudocode	No	The paper describes various algorithms such as Deep Hit, Nnet-survival, PMF, DRO, SR, FRL, DI, and CSA, and provides mathematical formulations for some, but does not present them in structured pseudocode or algorithm blocks.
Open Source Code	Yes	More implementation details can be found in Appendix E.2 and in our code repository. 4https://github.com/pth1993/FairTTE
Open Datasets	Yes	Fair TTE includes MIMIC-CXR [27] for predicting in-hospital mortality from chest X-ray images, ADNI [49] for predicting Alzheimer s disease from brain MRI images, and AREDS [14] for predicting late AMD from color fundus images. ... MIMIC-CXR: https://physionet.org/content/mimiciv/3.1/ MIMIC-CXR-JPG: https://physionet.org/content/mimic-cxr-jpg/2.1.0/ ADNI: https://adni.loni.usc.edu
Dataset Splits	Yes	Each dataset in our study was divided into training, validation, and testing sets using a 60%:20%:20% split ratio.
Hardware Specification	Yes	The experiments were conducted at a supercomputing center utilizing multiple compute nodes. Each node was equipped with an NVIDIA Volta V100 GPU with 16 GB of memory, an Intel Xeon CPU, and 32 GB of RAM, ensuring the computational resources necessary for large-scale experiments.
Software Dependencies	No	The Fair TTE benchmark is implemented using Python 3, with Py Torch [47] serving as the framework for deep learning computations. The implementation of TTE models is built on the pycox [41] package, while the evaluation metrics for TTE prediction leverage pycox, scikit-survival [50], and Survival EVAL [52].
Experiment Setup	Yes	To ensure a fair comparison, we perform a grid-based hyperparameter search using 10 random seeds. The details of the hyperparameter search for the methods used in our experiments are provided below. TTE prediction models Learning rate: 10x where x Uniform( 4, 3) Decay rate: 10x where x Uniform( 6, 4) Fair TTE prediction models η : 10x where x Uniform( 3, 1) (DRO) λ : 10x where x Uniform( 5, 2) (FRL)