Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Conformal Arbitrage: Risk-Controlled Balancing of Competing Objectives in Language Models

Authors: William Overman, Mohsen Bayati

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, Conformal Arbitrage traces an efficient frontier, allowing users to define an acceptable performance level for one objective while maximizing utility in another. We observe that our method outperforms (in terms of accuracy on multiple-choice style questions) cost-matched random routing between models. Our experiments study (i) the cost accuracy trade-off on Truthful QA and MMLU, and (ii) the helpfulness harmlessness trade-off on PKU-Safe RLHF. All three benchmarks are multiple-choice settings in which the model is prompted to select from a fixed set of options.
Researcher Affiliation Academia William Overman Stanford Graduate School of Business EMAIL Mohsen Bayati Stanford Graduate School of Business EMAIL
Pseudocode Yes Algorithm 1 Conformal Arbitrage Require: Context x, relaxation parameter ˆλ, Primary model p, Guardian model g 1: Compute p(x, a) for all a A(x) 2: Let Cλ(x) = n a A(x) : p(x, a) maxa p(x, a ) ˆλ o 3: if |Cλ(x)| = 1 then 4: return the unique element of Cλ(x) 5: else 6: Compute g(x, a) for all a Cλ(x) 7: return a = arg maxa Cλ(x) G(a) 8: end if
Open Source Code Yes 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: The Supplemental Material contains code for reproducing the main experimental results of the paper.
Open Datasets Yes We first study Conformal Arbitrage on the multiple-choice split of TRUTHFULQA (Lin et al., 2022), a benchmark designed to expose factual misconceptions in language models.1 The benchmark contains 684 questions, each paired with four answer choices and exactly one correct label. (Footnote 1: https://huggingface.co/datasets/EleutherAI/truthful_qa_mc) The PKU-SAFERLHF corpus contains 90k prompts, each paired with two distinct LLM responses.3 (Footnote 3: https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF) We next evaluate Conformal Arbitrage (CA) on the Massive Multitask Language Understanding benchmark (MMLU; (Hendrycks et al., 2021)). We load the public cais/mmlu distribution via datasets
Dataset Splits Yes Each trial draws n = 400 calibration and N = 284 test questions. Over 30 trials we draw 500/500 calibration evaluation splits from the 3,552 prompts For each trial we draw a fresh, balanced sample of Ntot = 1,000 questions, allocating n = 500 for calibration and the remaining 500 for evaluation.
Hardware Specification No The paper details that all calls are made via APIs, thus can be handled on a standard CPU.
Software Dependencies No The paper thoroughly documents experimental settings in Section 5 including calibration size, evaluation protocols, specific prompts (in corresponding Appendices), model details (e.g., gpt-4.1-nano-2025-04-14), and hyperparameter search spaces (e.g., Λ = {0, 0.01, 0.02, . . . , 1.0}).
Experiment Setup Yes We use temperature=0.1, max_tokens=50; replies that fail JSON parsing default to uniform scores, maintaining exchangeability. We fit ˆλ via Eq. (3) on Λ = {0, 0.01, . . . , 1} and repeat the calibration evaluation loop 30 times with fresh random splits. Over 30 trials we draw 500/500 calibration evaluation splits from the 3,552 prompts, tune ˆλ on Λ = {0, 0.0025, . . . , 1}, and evaluate at risk budgets α {0.10, 0.20, . . . , 0.60}.