Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Efficient Contextual LLM Cascades through Budget-Constrained Policy Learning
Authors: Xuechen Zhang, Zijian Huang, Ege Onur Taga, Carlee Joe-Wong, Samet Oymak, Jiasi Chen
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluations on standard reasoning datasets (GSM8K, CSQA, and LLC) with various LLMs and prompts show that TREACLE enables cost savings of up to 85% compared to baselines, while maintaining high accuracy. |
| Researcher Affiliation | Academia | Xuechen Zhang University of Michigan Ann Arbor, MI EMAIL Zijian Huang University of Michigan Ann Arbor, MI EMAIL Ege Onur Taga University of Michigan Ann Arbor, MI EMAIL Carlee Joe-Wong Carnegie Mellon University Pittsburgh, PA EMAIL Samet Oymak University of Michigan Ann Arbor, MI EMAIL Jiasi Chen University of Michigan Ann Arbor, MI EMAIL |
| Pseudocode | No | No pseudocode or algorithm blocks were found. |
| Open Source Code | Yes | We release our training and evaluation along with the data generation code in a zip file. |
| Open Datasets | Yes | We use three representative datasets: GSM8K [4], which contains 8.5K high quality grade school math problems created by human writers; CSQA [11], which consists of 12102 multiple choice commonsense reasoning questions encountered in daily life; and LLC [17], where the task is to concatenate the last letters of words in a name (e.g., Amy Brown yn ). |
| Dataset Splits | Yes | We further split the 7.5K training data into 6K training data and 1.5K validation data. |
| Hardware Specification | Yes | The Llama models are open-source and run locally on our servers (one A40 GPU for Llama-2-7b and two A40 for Llama-2-13b), while the GPT models rely on commercial APIs. |
| Software Dependencies | No | We used Deep Q-Network (DQN) [8] to train the reinforcement learning (RL) policy in TREACLE, consisting of a two-layer neural network. |
| Experiment Setup | Yes | During training, we used the Adam optimizer with a learning rate 1 10 4, Huber loss as the loss function, and a batch size of 64. Our DQN has three layers with Re LU and softmax activations, and the size of the hidden layer is 128. We set λ = 5 in the reward function. For re-queries, we set different temperature settings for Llama and GPT (0.8 and 1, respectively) because their ranges are different ([0, 1] and [0, 2] respectively). The actions are selected according to an ϵ-greedy policy. Simply put, the actions are sometimes chosen by the DQN and sometimes sampled uniformly. The probability of choosing a random action starts at εSTART = 0.9 and decays exponentially towards εEND = 0.05. For the reward decay, we use γ = 0.99. |