Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Fairness in Survival Analysis with Distributionally Robust Optimization

Authors: Shu Hu, George H. Chen

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate our sample splitting DRO approach by using it to create fair versions of a diverse set of existing survival analysis models including the classical Cox model (and its deep neural network variant Deep Surv), the discrete-time model Deep Hit, and the neural ODE model SODEN. We also establish a ﬁnite-sample theoretical guarantee to show what our sample splitting DRO loss converges to. Speciﬁcally for the Cox model, we further derive an exact DRO approach that does not use sample splitting. For all the survival models that we convert into DRO variants, we show that the DRO variants often score better on recently established fairness metrics (without incurring a signiﬁcant drop in accuracy) compared to existing survival analysis fairness regularization techniques, including ones which directly use sensitive demographic information in their training loss functions. We conduct experiments to compare DRO variants of Cox, Deep Hit, and SODEN models to their original non-DRO variants as well as to variants of these models that encourage fairness using non-DRO baseline regularization strategies.
Researcher Affiliation	Academia	Shu Hu EMAIL Department of Computer and Information Technology Purdue University Indianapolis, IN, 46202, USA George H. Chen EMAIL Heinz College of Information Systems and Public Policy Carnegie Mellon University Pittsburgh, PA, 15213, USA
Pseudocode	Yes	The pseudocode can be found in Algorithm 1. ... We provide the pseudocode in Algorithm 2.
Open Source Code	Yes	Our code is available at: https://github.com/discovershu/DRO_survival.
Open Datasets	Yes	The FLC dataset (Dispenzieri et al., 2012) is from a study on the relationship between serum free light chain (FLC) and mortality of Olmsted County residents aged 50 or higher. ... The SUPPORT dataset (Knaus et al., 1995) is from a study at Vanderbilt University on understanding prognoses, preferences, outcomes, and risks of treatment by analyzing survival times of severely ill hospitalized patients. ... The SEER dataset is on breast cancer patients from the Sureillance, Epidemiology, and End Results (SEER) program of the National Cancer Institute. ... We used 11 covariates that also appear in an existing snapshot of the SEER dataset (Teng, 2019).
Dataset Splits	Yes	For all models, we ﬁrst use a random 80%/20% train/test split to hold out a test set that will be the same across experimental repeats for all datasets. Then we repeat the following basic experiment 10 times: (1) We hold out 20% of the training data to treat as a validation set, which is used to tune hyperparameters.
Hardware Specification	Yes	All models are implemented with Python 3.8.3, and they are trained and tested on identical compute instances, each with an Intel Core i9-10900K CPU (3.70GHz with 64 GB RAM) and a Quadro RTX 4000 GPU.
Software Dependencies	Yes	All models are implemented with Python 3.8.3, and they are trained and tested on identical compute instances, each with an Intel Core i9-10900K CPU (3.70GHz with 64 GB RAM) and a Quadro RTX 4000 GPU. ... All models (linear and nonlinear) are trained using Adam (Kingma and Ba, 2014) in Py Torch 1.7.1 in a batch setting for 500 iterations (except in the case of the exact DRO Cox model on the FLC dataset, where we use 5000 iterations as it took more iterations for the model to converge), only using a CPU and no GPU.
Experiment Setup	Yes	For all models, we ﬁrst use a random 80%/20% train/test split to hold out a test set that will be the same across experimental repeats for all datasets. Then we repeat the following basic experiment 10 times: (1) We hold out 20% of the training data to treat as a validation set, which is used to tune hyperparameters. ... More hyperparameter settings can be found in Appendix G. ... for nonlinear Cox models, we always use a two-layer MLP with Re LU as the activation function and 24 as the number of hidden units. All models (linear and nonlinear) are trained using Adam (Kingma and Ba, 2014) in Py Torch 1.7.1 in a batch setting for 500 iterations ... To ﬁnd the optimal learning rate for each Cox model, we conducted a sweep over values of 0.01, 0.001, and 0.0001. ... Deep Hit models: we use three-layer MLP with Re LU activation, batch normalization, and dropout (in 0.1). The number of hidden units is 32. ... SODEN models: for the FLC dataset, we use an MLP with 4 layers and 16 hidden units. For SUPPORT and SEER datasets, we use an MLP with 2 layers and 26 hidden units. In addition, RMSprop (Tieleman et al., 2012) in 128 batch size with a maximum 100 epochs is used to train all models.