Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Rational Tuning of LLM Cascades via Probabilistic Modeling
Authors: Michael J. Zellinger, Matt Thomson
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Compared to selecting confidence thresholds using Bayesian optimization, our parametric Markov-copula model yields more favorable error-cost trade-offs, improving the area under the error-cost curve by 4.3% on average for cascades with k 3 models. In the low-sample regime with n 30 training examples, the performance improvement widens to 10.2%, suggesting that our framework s inductive assumptions about the interactions between the error rates of different LLMs enhance sample efficiency. |
| Researcher Affiliation | Academia | The paper lists only the authors' names (Michael J. Zellinger and Matt Thomson) without any explicit institutional affiliations or email addresses in the provided text. Thus, it is not possible to classify their affiliations as industry, academia, or a collaboration. |
| Pseudocode | Yes | Algorithm 1 Computing P(Correct) and E[Cost] Require: confidence thresholds ϕ1, ..., ϕk 1 Rk 1 1: cum_cost E[C1] # cumulative expected cost 2: cum_transition_prob 1 # cumulative transition probability 3: correctness_terms [ ] # expected correctness due to different models 4: cost_terms [ ] # expected costs due to different models 5: ϕk |
| Open Source Code | Yes | Code for reproducing the results of the paper is available on Git Hub.3. 3Code for reproducing the results of the paper is available at github.com/mzelling/rational-llm-cascades. |
| Open Datasets | Yes | Benchmarks: we evaluate our probabilistic model and the error-cost curves of LLM cascades on six language modeling benchmarks including MMLU (Hendrycks et al., 2021); Med MCQA (Pal et al., 2022); Trivia QA (Joshi et al., 2017); XSum (Narayan et al., 2018); GSM8K (Cobbe et al., 2021); and Truthful QA (Lin et al., 2022b). |
| Dataset Splits | Yes | For each benchmark, we use 300 examples for training, and 1000 examples for testing, except on MMLU and Truthful QA. On MMLU, the dev set contains only 285 examples, of which we use all. The validation set consists of 1531 examples and is divided into different subjects; to avoid bias from subject selection, we take all 1531 validation examples for testing. On Truthful QA, the entire data set consists only of 817 observations, of which we randomly select 300 for training and the remaining 517 for testing. |
| Hardware Specification | No | The paper mentions using "Open AI API" and "Fireworks API" for running inference, which are cloud services. However, it does not provide specific hardware details such as GPU models, CPU types, or detailed specifications of the cloud resources used for the experiments. It discusses model costs and prices but not the underlying hardware. |
| Software Dependencies | No | The paper mentions "a preliminary version of the niagara Python package for LLM cascading", "HEBO package", and "the Python package paretoset" but does not specify version numbers for any of these software dependencies. |
| Experiment Setup | Yes | We formulate the optimization problem θ = arg min θ (1 Pθ(Correct)) + λ Eθ[Cost], where θ Rk 1 denotes the confidence thresholds (ϕ1, ..., ϕk 1). The Lagrange multiplier λ 0 indicates the user s cost sensitivity. To solve the minimization problem (11), we use the L-BFGS-B optimizer, a low-memory version of the Broyden Fletcher Goldfarb Shanno algorithm (Liu and Nocedal, 1989) modified to handle simple box constraints. The Bayesian optimization minimizes (11), in an analogous manner to our Markov-copula ( Rational Tuning ) approach. We run HEBO for as many iterations as needed until the change in loss between successive iterations is below a numerical tolerance (ϵ = 10 5). In practice, we found that the final change in loss is typically 0.0. Following the practical guidance of HEBO s authors5, we use four parallel suggestions during each iteration. |