reproducibilityindex.ai

Efficient Contextual LLM Cascades through Budget-Constrained Policy Learning

Authors: Xuechen Zhang, Zijian Huang, Ege Onur Taga, Carlee Joe-Wong, Samet Oymak, Jiasi Chen

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our evaluations on standard reasoning datasets (GSM8K, CSQA, and LLC) with various LLMs and prompts show that TREACLE enables cost savings of up to 85% compared to baselines, while maintaining high accuracy.
Researcher Affiliation	Academia	Xuechen Zhang University of Michigan Ann Arbor, MI zxuechen@umich.edu Zijian Huang University of Michigan Ann Arbor, MI zijianh@umich.edu Ege Onur Taga University of Michigan Ann Arbor, MI egetaga@umich.edu Carlee Joe-Wong Carnegie Mellon University Pittsburgh, PA cjoewong@andrew.cmu.edu Samet Oymak University of Michigan Ann Arbor, MI oymak@umich.edu Jiasi Chen University of Michigan Ann Arbor, MI jiasi@umich.edu
Pseudocode	No	No pseudocode or algorithm blocks were found.
Open Source Code	Yes	We release our training and evaluation along with the data generation code in a zip file.
Open Datasets	Yes	We use three representative datasets: GSM8K [4], which contains 8.5K high quality grade school math problems created by human writers; CSQA [11], which consists of 12102 multiple choice commonsense reasoning questions encountered in daily life; and LLC [17], where the task is to concatenate the last letters of words in a name (e.g., Amy Brown yn ).
Dataset Splits	Yes	We further split the 7.5K training data into 6K training data and 1.5K validation data.
Hardware Specification	Yes	The Llama models are open-source and run locally on our servers (one A40 GPU for Llama-2-7b and two A40 for Llama-2-13b), while the GPT models rely on commercial APIs.
Software Dependencies	No	We used Deep Q-Network (DQN) [8] to train the reinforcement learning (RL) policy in TREACLE, consisting of a two-layer neural network.
Experiment Setup	Yes	During training, we used the Adam optimizer with a learning rate 1 10 4, Huber loss as the loss function, and a batch size of 64. Our DQN has three layers with Re LU and softmax activations, and the size of the hidden layer is 128. We set λ = 5 in the reward function. For re-queries, we set different temperature settings for Llama and GPT (0.8 and 1, respectively) because their ranges are different ([0, 1] and [0, 2] respectively). The actions are selected according to an ϵ-greedy policy. Simply put, the actions are sometimes chosen by the DQN and sometimes sampled uniformly. The probability of choosing a random action starts at εSTART = 0.9 and decays exponentially towards εEND = 0.05. For the reward decay, we use γ = 0.99.