Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Effective Ways to Build and Evaluate Individual Survival Distributions

Authors: Humza Haider, Bret Hoehn, Sarah Davis, Russell Greiner

JMLR 2020 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This paper ﬁrst motivates such individual survival distribution (isd) models, and explains how they diﬀer from standard models. It then discusses ways to evaluate such models namely Concordance, 1-Calibration, Integrated Brier score, and versions of L1-loss then motivates and deﬁnes a novel approach, D-Calibration , which determines whether a model s probability estimates are meaningful. We also discuss how these measures diﬀer, and use them to evaluate several isd prediction tools over a range of survival data sets. We also provide a code base for all of these survival models and evaluation measures, at https://github.com/haiderstats/ISDEvaluation.
Researcher Affiliation	Academia	Humza Haider EMAIL Bret Hoehn EMAIL Sarah Davis EMAIL Russell Greiner EMAIL Department of Computing Science University of Alberta Edmonton, AB T6G 2E8
Pseudocode	Yes	Algorithm 1 summarizes the 1-Calibration algorithm. Algorithm 2 summarizes the D-calibration process.
Open Source Code	Yes	We also provide a code base for all of these survival models and evaluation measures, at https://github.com/haiderstats/ISDEvaluation. all code used in this analysis is publicly available on the Git Hub account of the lead author.
Open Datasets	Yes	There are many diﬀerent survival data sets; here, we selected 8 publicly available medical data sets in order to cover a wide range of sample sizes, number of features, and proportions of censored patients. We excluded small data sets (with fewer than 150 instances) to reduce the variance in the evaluation metrics. Our data sets ranged from 170 to 2402 patients, from 12 to 7401 features, and percentage of censoring from 17.23% to 86.21%; see Table 5. Note that we have not included extremely high-dimensional data (with tens of thousands of features, often found in genomic data sets), as such data raises additional challenges beyond the scope of standard survival analysis; see Witten and Tibshirani (2010) and Kumar and Greiner (2019) for methods to handle such extremely high-dimensional data. The Northern Alberta Cancer Dataset (NACD), with 2402 patients and 53 features, is a conglomerate of many diﬀerent cancer patients, including lung, colorectal, head and neck, esophagal, stomach, and other cancers. In addition to using the complete NACD data set, we considered the subset of 950 patients with colorectal cancer (Nacd-Col) with the same 53 features. Another four data sets were retrieved from data generated by The Cancer Genome Atlas (TCGA) Research Network (Genome Data Analysis Center, 2016): Glioblastoma multiforme (GBM; 592 patients, 12 features), Glioma (GLI; 1105 patients, 13 features), Rectum adenocarcinoma (READ; 170 patients, 18 features), and Breast invasive carcinoma (BRCA; 1095 patients, 61 features). To ensure a variety of feature/sample-size ratios, we consider only the clinical features in our experiments. Lastly, we included two high-dimensional data sets: the Dutch Breast Cancer Dataset (DBCD; van Houwelingen et al., 2006) contains 4919 microarray gene expression levels for 295 women with breast cancer, and the Diﬀuse Large B-Cell Lymphoma (DLBCL; Li et al., 2016) data set contains 7401 features focusing on Lymphochip DNA microarrays for 240 biopsy samples.
Dataset Splits	Yes	Following feature selection, the data was partitioned into 5 disjoint folds by ﬁrst sorting the instances by time and censorship, then placing each censored (resp., uncensored) instance sequentially into the folds meaning all folds had roughly the same distribution of times and censorships. The values of each feature were then normalized (transformed to zero mean with unit variance) within each fold. For coxen-kp, rsf-km, and mtlr, we used an internal 5CV for hyper-parameter selection.
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. It only mentions 'Empirical evaluations were completed in R version 3.4.4.', which refers to software.
Software Dependencies	Yes	Empirical evaluations were completed in R version 3.4.4. The implementations of km, aft, and cox-kp can all be found in the survival package (Therneau, 2015) whereas coxenkp uses the cocktail function found in the fastcox package (Yang and Zou, 2017). Both rsf and rsf-km come from the randomForestSRC package (Ishwaran and Kogalur, 2018). An implementation of mtlr can be found in the MTLR package
Experiment Setup	Yes	For coxen-kp, rsf-km, and mtlr, we used an internal 5CV for hyper-parameter selection. There were no hyper-parameters to tune for the remaining models: cox, km, and aft. As 1-Calibration required speciﬁc time points, and as models might perform well on some survival times but poorly on others, we chose ﬁve times to assess the calibration results of each model: the 10th, 25th, 50th, 75th, and 90th percentiles of survival times for each data set.