Efficient Contextual LLM Cascades through Budget-Constrained Policy Learning

Authors: Xuechen Zhang, Zijian Huang, Ege Onur Taga, Carlee Joe-Wong, Samet Oymak, Jiasi Chen

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our evaluations on standard reasoning datasets (GSM8K, CSQA, and LLC) with various LLMs and prompts show that TREACLE enables cost savings of up to 85% compared to baselines, while maintaining high accuracy.
Researcher Affiliation Academia Xuechen Zhang University of Michigan Ann Arbor, MI zxuechen@umich.edu Zijian Huang University of Michigan Ann Arbor, MI zijianh@umich.edu Ege Onur Taga University of Michigan Ann Arbor, MI egetaga@umich.edu Carlee Joe-Wong Carnegie Mellon University Pittsburgh, PA cjoewong@andrew.cmu.edu Samet Oymak University of Michigan Ann Arbor, MI oymak@umich.edu Jiasi Chen University of Michigan Ann Arbor, MI jiasi@umich.edu
Pseudocode No No pseudocode or algorithm blocks were found.
Open Source Code Yes We release our training and evaluation along with the data generation code in a zip file.
Open Datasets Yes We use three representative datasets: GSM8K [4], which contains 8.5K high quality grade school math problems created by human writers; CSQA [11], which consists of 12102 multiple choice commonsense reasoning questions encountered in daily life; and LLC [17], where the task is to concatenate the last letters of words in a name (e.g., Amy Brown yn ).
Dataset Splits Yes We further split the 7.5K training data into 6K training data and 1.5K validation data.
Hardware Specification Yes The Llama models are open-source and run locally on our servers (one A40 GPU for Llama-2-7b and two A40 for Llama-2-13b), while the GPT models rely on commercial APIs.
Software Dependencies No We used Deep Q-Network (DQN) [8] to train the reinforcement learning (RL) policy in TREACLE, consisting of a two-layer neural network.
Experiment Setup Yes During training, we used the Adam optimizer with a learning rate 1 10 4, Huber loss as the loss function, and a batch size of 64. Our DQN has three layers with Re LU and softmax activations, and the size of the hidden layer is 128. We set λ = 5 in the reward function. For re-queries, we set different temperature settings for Llama and GPT (0.8 and 1, respectively) because their ranges are different ([0, 1] and [0, 2] respectively). The actions are selected according to an ϵ-greedy policy. Simply put, the actions are sometimes chosen by the DQN and sometimes sampled uniformly. The probability of choosing a random action starts at εSTART = 0.9 and decays exponentially towards εEND = 0.05. For the reward decay, we use γ = 0.99.