Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

KSP: Kolmogorov-Smirnov metric-based Post-Hoc Calibration for Survival Analysis

Authors: Jeongho Park, Daheen Kim, Cheoljun Kim, Hyungbin Park, Sangwook Kang, Gwangsu Kim

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments on diverse real-world datasets using a variety of survival models. Empirical results demonstrate that our method consistently improves calibration performance over existing methods while maintaining high predictive accuracy. We also provide a theoretical analysis of the KS metric and discuss extensions to in-processing settings.
Researcher Affiliation	Academia	1Department of Statistics and Data Science, Yonsei University, Seoul, Republic of Korea 2Department of Statistics, Jeonbuk National University, Jeonju, Republic of Korea 3Research Institute for Materials and Energy Sciences, Jeonbuk National University 1EMAIL 2EMAIL
Pseudocode	Yes	We propose KS-cal based post-processing (KSP), similar to Platt scaling (Platt, 1999) in classification tasks. We simply transform the original ˆFθ into a modified ˆF θ [0, 1] that minimizes the KS-cal. The procedure is summarized in the following algorithm. Algorithm. KSP 1: Input: Estimated CDFs ˆFθ, strictly monotone increasing link function G : [0, 1] ( , ) 2: Initialize parameters a (> 0), b, α (> 0) 3: Sort ˆFθ for computational efficiency 4: while KS-cal not improved do 5: Compute transformed CDF: ˆF θ = n G 1 a G( ˆFθ) + b oα 6: Compute KS-cal on validation set: max 1 j N D j , where D j denotes Dj evaluated using ˆF θ 7: Update (a, b, α) via gradient descent (ADAM) to minimize the KS-cal 8: end while 9: Apply final calibrated transformation to the test set using optimized (a, b, α) 10: Output: Calibrated CDF ˆF θ
Open Source Code	Yes	The source code for reproducing our experiments is available at: https://github.com/wjdgh4325/KS-cal
Open Datasets	Yes	We evaluate all methods on ten benchmark datasets: WHAS, METABRIC, GBSG, NACD, NB-SEQ, SUPPORT, MIMIC-III, SEER-liver, SEER-stomach, and SEER-lung. These are grouped by sample size into three categories: Small (WHAS, METABRIC, GBSG, NACD), Medium (NBSEQ, SUPPORT, MIMIC-III), and Large (SEER-liver, SEER-stomach, SEER-lung). Details on the datasets and preprocessing steps are provided in Appendix D.
Dataset Splits	Yes	Each dataset is randomly split into training, validation, and test sets in a 3:1:1 ratio, with balanced censoring rates. All experiments are repeated 30 times with different random seeds. For CSD and CSD-i POT, we use the validation set as the conformal set to enable a fair comparison with KSP.
Hardware Specification	Yes	Experiments are conducted with an Intel Xeon Silver 6226R CPU and an NVIDIA Ge Force RTX 3090 GPU.
Software Dependencies	No	The paper mentions using publicly available code repositories under the MIT License for X-cal, CSD, and CSD-iPOT. It also mentions models like Deep Surv, MTLR, Parametric, CRPS, Deep Hit, and AFT, and the ADAM optimizer. However, it does not specify version numbers for any key software components like programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or specific model implementations.
Experiment Setup	Yes	Table 30: Set-ups used in experiments. Dataset Model Batch size Learning rate Maximum epoch Dropout rate Hidden layer Number of bins... All models were optimized using the ADAM optimizer with early stopping (patience = 200).