Efficient Contextual LLM Cascades through Budget-Constrained Policy Learning
Authors: Xuechen Zhang, Zijian Huang, Ege Onur Taga, Carlee Joe-Wong, Samet Oymak, Jiasi Chen
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluations on standard reasoning datasets (GSM8K, CSQA, and LLC) with various LLMs and prompts show that TREACLE enables cost savings of up to 85% compared to baselines, while maintaining high accuracy. |
| Researcher Affiliation | Academia | Xuechen Zhang University of Michigan Ann Arbor, MI zxuechen@umich.edu Zijian Huang University of Michigan Ann Arbor, MI zijianh@umich.edu Ege Onur Taga University of Michigan Ann Arbor, MI egetaga@umich.edu Carlee Joe-Wong Carnegie Mellon University Pittsburgh, PA cjoewong@andrew.cmu.edu Samet Oymak University of Michigan Ann Arbor, MI oymak@umich.edu Jiasi Chen University of Michigan Ann Arbor, MI jiasi@umich.edu |
| Pseudocode | No | No pseudocode or algorithm blocks were found. |
| Open Source Code | Yes | We release our training and evaluation along with the data generation code in a zip file. |
| Open Datasets | Yes | We use three representative datasets: GSM8K [4], which contains 8.5K high quality grade school math problems created by human writers; CSQA [11], which consists of 12102 multiple choice commonsense reasoning questions encountered in daily life; and LLC [17], where the task is to concatenate the last letters of words in a name (e.g., Amy Brown yn ). |
| Dataset Splits | Yes | We further split the 7.5K training data into 6K training data and 1.5K validation data. |
| Hardware Specification | Yes | The Llama models are open-source and run locally on our servers (one A40 GPU for Llama-2-7b and two A40 for Llama-2-13b), while the GPT models rely on commercial APIs. |
| Software Dependencies | No | We used Deep Q-Network (DQN) [8] to train the reinforcement learning (RL) policy in TREACLE, consisting of a two-layer neural network. |
| Experiment Setup | Yes | During training, we used the Adam optimizer with a learning rate 1 10 4, Huber loss as the loss function, and a batch size of 64. Our DQN has three layers with Re LU and softmax activations, and the size of the hidden layer is 128. We set λ = 5 in the reward function. For re-queries, we set different temperature settings for Llama and GPT (0.8 and 1, respectively) because their ranges are different ([0, 1] and [0, 2] respectively). The actions are selected according to an ϵ-greedy policy. Simply put, the actions are sometimes chosen by the DQN and sometimes sampled uniformly. The probability of choosing a random action starts at εSTART = 0.9 and decays exponentially towards εEND = 0.05. For the reward decay, we use γ = 0.99. |