Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Conformal Arbitrage: Risk-Controlled Balancing of Competing Objectives in Language Models

Authors: William Overman, Mohsen Bayati

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, Conformal Arbitrage traces an efficient frontier, allowing users to define an acceptable performance level for one objective while maximizing utility in another. We observe that our method outperforms (in terms of accuracy on multiple-choice style questions) cost-matched random routing between models. Our experiments study (i) the cost accuracy trade-off on Truthful QA and MMLU, and (ii) the helpfulness harmlessness trade-off on PKU-Safe RLHF. All three benchmarks are multiple-choice settings in which the model is prompted to select from a fixed set of options.
Researcher Affiliation	Academia	William Overman Stanford Graduate School of Business EMAIL Mohsen Bayati Stanford Graduate School of Business EMAIL
Pseudocode	Yes	Algorithm 1 Conformal Arbitrage Require: Context x, relaxation parameter ˆλ, Primary model p, Guardian model g 1: Compute p(x, a) for all a A(x) 2: Let Cλ(x) = n a A(x) : p(x, a) maxa p(x, a ) ˆλ o 3: if \|Cλ(x)\| = 1 then 4: return the unique element of Cλ(x) 5: else 6: Compute g(x, a) for all a Cλ(x) 7: return a = arg maxa Cλ(x) G(a) 8: end if
Open Source Code	Yes	5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: The Supplemental Material contains code for reproducing the main experimental results of the paper.
Open Datasets	Yes	We first study Conformal Arbitrage on the multiple-choice split of TRUTHFULQA (Lin et al., 2022), a benchmark designed to expose factual misconceptions in language models.1 The benchmark contains 684 questions, each paired with four answer choices and exactly one correct label. (Footnote 1: https://huggingface.co/datasets/EleutherAI/truthful_qa_mc) The PKU-SAFERLHF corpus contains 90k prompts, each paired with two distinct LLM responses.3 (Footnote 3: https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF) We next evaluate Conformal Arbitrage (CA) on the Massive Multitask Language Understanding benchmark (MMLU; (Hendrycks et al., 2021)). We load the public cais/mmlu distribution via datasets
Dataset Splits	Yes	Each trial draws n = 400 calibration and N = 284 test questions. Over 30 trials we draw 500/500 calibration evaluation splits from the 3,552 prompts For each trial we draw a fresh, balanced sample of Ntot = 1,000 questions, allocating n = 500 for calibration and the remaining 500 for evaluation.
Hardware Specification	No	The paper details that all calls are made via APIs, thus can be handled on a standard CPU.
Software Dependencies	No	The paper thoroughly documents experimental settings in Section 5 including calibration size, evaluation protocols, specific prompts (in corresponding Appendices), model details (e.g., gpt-4.1-nano-2025-04-14), and hyperparameter search spaces (e.g., Λ = {0, 0.01, 0.02, . . . , 1.0}).
Experiment Setup	Yes	We use temperature=0.1, max_tokens=50; replies that fail JSON parsing default to uniform scores, maintaining exchangeability. We fit ˆλ via Eq. (3) on Λ = {0, 0.01, . . . , 1} and repeat the calibration evaluation loop 30 times with fresh random splits. Over 30 trials we draw 500/500 calibration evaluation splits from the 3,552 prompts, tune ˆλ on Λ = {0, 0.0025, . . . , 1}, and evaluate at risk budgets α {0.10, 0.20, . . . , 0.60}.