Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Cascaded Language Models for Cost-Effective Human–AI Decision-Making
Authors: Claudio Fanconi, Mihaela van der Schaar
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate this approach to general question-answering (ARC-Easy, ARC-Challenge, and MMLU) and medical question-answering (Med QA and Med MCQA). Our results demonstrate that our cascaded strategy outperforms single-model baselines in most cases, achieving higher accuracy while reducing costs and providing a principled approach to handling abstentions.1 |
| Researcher Affiliation | Academia | Claudio Fanconi University of Cambridge EMAIL Mihaela van der Schaar University of Cambridge EMAIL |
| Pseudocode | No | The paper describes the method and decision flow using natural language and mathematical equations (e.g., Equation 1, 2, 3, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15) and figures (Figure 1, Figure 2) but does not present any formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1We provide the code for our experiments at https://github.com/fanconic/cascaded-llms |
| Open Datasets | Yes | To evaluate the generalisability of our framework across domains, we use five question-answering datasets: (1) ARC2-Easy and (2) ARC2-Challenge [Clark et al., 2018], which are part of the AI2 Reasoning Challenge and require reasoning over grade-school science; (3) Massive Multitask Language Understanding (MMLU) benchmark [Hendrycks et al., 2021], which covers 57 subjects ranging from complex STEM to international law, nutrition, and religion; and two medical QA benchmarks: (4) Med QA [Jin et al., 2020], consisting of US medical board exam questions, and (5) Med MCQA [Pal et al., 2022], comprising entrance exam questions from the Indian medical school curriculum. |
| Dataset Splits | Yes | We fit a Bayesian logistic regression model on a small calibration set of 100 samples. |
| Hardware Specification | Yes | Compute. Experiments are conducted on a single A100-class GPUs. |
| Software Dependencies | No | All experiments are implemented in Python [Van Rossum and Drake Jr, 1995] with Py Torch [Paszke et al., 2017] and Hugging Face Transformers [Wolf et al., 2020]. |
| Experiment Setup | Yes | The cost proportion between input and output tokens is set to ρ = 5, consistent with Anthropic s current pricing to date [Anthropic, 2025]. ... We initialise the thresholds at θ(0) = {0.5, 0.05, 0.05}, where ξi = 0.05 corresponds to the standard deviation of 5% confidence. ... To avoid trivial solutions (e.g., always selecting one model), we balance system risk using λc = 10^-5 and λa = 0.1, in line with Zellinger et al. [2025]. |