Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Cascaded Language Models for Cost-Effective Human–AI Decision-Making

Authors: Claudio Fanconi, Mihaela van der Schaar

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate this approach to general question-answering (ARC-Easy, ARC-Challenge, and MMLU) and medical question-answering (Med QA and Med MCQA). Our results demonstrate that our cascaded strategy outperforms single-model baselines in most cases, achieving higher accuracy while reducing costs and providing a principled approach to handling abstentions.1
Researcher Affiliation	Academia	Claudio Fanconi University of Cambridge EMAIL Mihaela van der Schaar University of Cambridge EMAIL
Pseudocode	No	The paper describes the method and decision flow using natural language and mathematical equations (e.g., Equation 1, 2, 3, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15) and figures (Figure 1, Figure 2) but does not present any formal pseudocode or algorithm blocks.
Open Source Code	Yes	1We provide the code for our experiments at https://github.com/fanconic/cascaded-llms
Open Datasets	Yes	To evaluate the generalisability of our framework across domains, we use five question-answering datasets: (1) ARC2-Easy and (2) ARC2-Challenge [Clark et al., 2018], which are part of the AI2 Reasoning Challenge and require reasoning over grade-school science; (3) Massive Multitask Language Understanding (MMLU) benchmark [Hendrycks et al., 2021], which covers 57 subjects ranging from complex STEM to international law, nutrition, and religion; and two medical QA benchmarks: (4) Med QA [Jin et al., 2020], consisting of US medical board exam questions, and (5) Med MCQA [Pal et al., 2022], comprising entrance exam questions from the Indian medical school curriculum.
Dataset Splits	Yes	We fit a Bayesian logistic regression model on a small calibration set of 100 samples.
Hardware Specification	Yes	Compute. Experiments are conducted on a single A100-class GPUs.
Software Dependencies	No	All experiments are implemented in Python [Van Rossum and Drake Jr, 1995] with Py Torch [Paszke et al., 2017] and Hugging Face Transformers [Wolf et al., 2020].
Experiment Setup	Yes	The cost proportion between input and output tokens is set to ρ = 5, consistent with Anthropic s current pricing to date [Anthropic, 2025]. ... We initialise the thresholds at θ(0) = {0.5, 0.05, 0.05}, where ξi = 0.05 corresponds to the standard deviation of 5% confidence. ... To avoid trivial solutions (e.g., always selecting one model), we balance system risk using λc = 10^-5 and λa = 0.1, in line with Zellinger et al. [2025].