Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Selective Omniprediction and Fair Abstention

Authors: Sílvia Casacuberta, Varun Kanade

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The main results and the focus of our work is theoretical. We do however also provide an empirical evaluation on synthetic data that shows that our algorithms are easy to implement and achieve the desired outcomes.
Researcher Affiliation	Academia	Sílvia Casacuberta Department of Computer Science Stanford University EMAIL Varun Kanade Department of Computer Science University of Oxford EMAIL
Pseudocode	No	The paper describes various algorithms and methods, such as the multicalibration algorithm and post-processing functions, but it does not present them in structured pseudocode or algorithm blocks. The steps are explained in narrative text within the main body and appendices.
Open Source Code	Yes	The code for this paper can be found at https://github.com/silviacasac/learning-to-abstain.
Open Datasets	No	We generate synthetic data to create a binary classification problem with n = 10,000 samples and implement the multicalibration algorithm to baseline predictions to obtain a C-multicalibrated predictor h... We generate 5,000 data samples synthetically using sk-learn s make_blobs function
Dataset Splits	Yes	The multicalibration algorithm is run on the validation set (20% of the data) and we then report all of our statistics on the test set (20% of the data).
Hardware Specification	Yes	These experiments were conducted locally using a system equipped with an M1 chip and 16 GB of local memory.
Software Dependencies	No	We used Chat GPT to help debug the code and implement the abstaining decision trees, and we studied the multicalibration code provided in the Python package from the paper [29] to aid us with our implementation (which we did from scratch, given that the implementation in [29] finds correlation with the residuals using the Boolean groups g in G, whereas we want to use the real-valued concepts c in the concept class C).
Experiment Setup	Yes	For the multicalibration algorithm, we use a discretization parameter of 0.1, a learning rate of 0.01, and 200 maximum iterations. The multicalibration algorithm is run on the validation set (20% of the data) and we then report all of our statistics on the test set (20% of the data).