Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Conformal Risk Training: End-to-End Optimization of Conformal Risk Control

Authors: Christopher Yeh, Nicolas Christianson, Adam Wierman, Yisong Yue

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we demonstrate empirically that fine-tuning models using conformal risk training leads to significant performance improvements while guaranteeing satisfaction of risk constraints. We present results for maximizing model specificity while controlling false negative rate on a tumor image segmentation task, as well as maximizing average profit while controlling tail-risk of losses in a battery storage operation task.
Researcher Affiliation	Academia	Christopher Yeh, Nicolas Christianson, Adam Wierman, Yisong Yue Department of Computing and Mathematical Sciences California Institute of Technology Pasadena, CA 91125 EMAIL
Pseudocode	Yes	Algorithm 1 (Post-hoc) Conformal OCE Risk Control (CORC) ... Algorithm 2 Conformal Risk Training ... Algorithm 3 Conformal Risk Control (CRC) for left-continuous, nondecreasing losses
Open Source Code	Yes	Code to reproduce our results are available on Git Hub,2 and additional experimental results and details are reported in Appendices A and D. ... 2https://github.com/chrisyeh96/conformal-risk-training
Open Datasets	Yes	We adopt the colonoscopy gut polyp image segmentation problem setup explored in [6, Section 3.1] and described in Example 1. We use a pre-trained Pra Net [23] as our model fθ, and we split images from 4 public datasets (CVC-Clinic DB [13], CVC-Colon DB [12], ETIS-Larib Polyp DB [46], Kvasir-SEG [28]) into training, calibration, and test splits. ... We use the same dataset as Donti et al. [19] in our battery storage problem.
Dataset Splits	Yes	We used 1,450 images from the CVC-Clinic DB and Kvasir SEG datasets to form a training set; we used the same training split as [23]. From the remaining 738 images, we created 10 different val (400 images) / test (338 images) splits from different random seeds. ... We take a random 20% subset of the dataset as the test set; because the test set is selected randomly, it is considered exchangeable with the rest of the dataset. For each of 10 seeds, we further use a 65%/35% random split of the remaining data for training and calibration.
Hardware Specification	Yes	Our experiments were performed on two computers with the following hardware: 1. 2 AMD EPYC 7513 32-Core CPUs, 4 NVIDIA A100 GPUs (80Gi B GPU memory each), 1 Ti B RAM 2. Intel Core i9-12900KS CPU, 2 NVIDIA RTX A6000 GPUs (48Gi B GPU memory each), 125 Gi B RAM
Software Dependencies	No	Our code (see supplementary ZIP file) is written in Python and primarily relies on the Py Torch [40] deep learning library. We use the Adam optimizer [30].
Experiment Setup	Yes	We tune the learning rate with a grid search over (10 2, 10 3, 10 4, 10 5, 10 6). Models are trained with a batch size of 400 for up to 100 epochs with early-stopping after 10 epochs of no improvement on the val set. ... During pretraining, we tune the learning rate across (10 4, 10 3.5, 10 3, 10 2.5, 10 2, 10 1.5), and we use L2 weight deecay strength of 10 4. We pretrain for up to 500 epochs with early-stopping after 20 epochs of no improvement on the val set. ... During fine-tuning, we tune the learning rate with a grid search over (10 2, 10 3, 10 4, 10 5) and we keep the 10 4 L2 weight decay. We fine-tune for up to 100 epochs with early-stopping after 10 epochs of no improvement on the val set. During fine-tuning, we use a 90%/10% weighted combination of the fine-tuning loss and the MSE loss.