Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Gatekeeper: Improving Model Cascades Through Confidence Tuning

Authors: Stephan Rabanser, Nathalie Rauschmayr, Achin Kulshrestha, Petra Poklukar, Wittawat Jitkrittum, Sean Augenstein, Congchao Wang, Federico Tombari

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments across image classification, language modeling, and vision-language tasks show that our approach substantially improves deferral performance.
Researcher Affiliation	Collaboration	Stephan Rabanser Princeton University EMAIL Nathalie Rauschmayr Google EMAIL
Pseudocode	No	The paper describes the GATEKEEPER loss function using mathematical formulas and outlines practical computation steps but does not present a formal pseudocode block or algorithm.
Open Source Code	No	Answer: [No] Justification: We are not providing code for this work.
Open Datasets	Yes	We train both a large model and a small model on the following datasets: CIFAR-10/100 (Krizhevsky et al., 2009), Food-101 (Bossard et al., 2014), and Tiny Image Net200 (Le & Yang, 2015). The datasets used are ARCe/c (Clark et al., 2018), MMLU (Hendrycks et al., 2020), and GSM8K (Cobbe et al., 2021). The datasets we consider are two classification datasets (VQAv2 (Goyal et al., 2017), AI2D (Hiippala et al., 2021)) and two captioning datasets (Cococap (Lin et al., 2014), Screen2Words (Wang et al., 2021)).
Dataset Splits	Yes	Our experiments begin by taking the instruction-tuned checkpoints of Gemma2B and Gemma7B and fine-tuning both models on the training split of each dataset to ensure that the model (i) performs well on the task and (ii) is familiar with the desired response format. This step is performed using standard supervised fine-tuning. Next, we fine-tune MS with GATEKEEPER on the same training split to reduce confidence on incorrect next-token predictions. Finally, we evaluate the model trained with GATEKEEPER on a validation split.
Hardware Specification	No	The paper does not explicitly mention any specific hardware (e.g., GPU models, CPU types, or memory) used for running the experiments. It refers to 'large models' and 'small models' and 'LMs' or 'VLMs' without hardware specifics.
Software Dependencies	No	The paper includes a code snippet in Appendix C.1 which uses PyTorch-like components (Conv2d, BatchNorm2d, ReLU, MaxPool2d, Linear), but no specific version numbers for PyTorch or any other software dependencies are provided.
Experiment Setup	Yes	In its canonical form, GATEKEEPER is defined as a hybrid loss L = αLcorr + (1 α)Lincorr... Here, yi and ˆyi are the true and predicted labels for xi, respectively, pi is the predicted probability distribution of MS over classes, U represents the uniform distribution over all classes, N denotes the number samples in the current batch, α (0, 1) is a tunable hyperparameter controlling the emphasis between correct and incorrect predictions... We report performance for both a baseline model (an instance of MS not trained with GATEKEEPER) and small models trained with GATEKEEPER at various α values.