Smoothie: Label Free Language Model Routing

Authors: Neel Guha, Mayee Chen, Trevor Chow, Ishan Khare, Christopher Ré

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically evaluate SMOOTHIE in three stages.
Researcher Affiliation Academia Neel Guha Mayee F. Chen Trevor Chow Ishan S. Khare Christopher Ré Stanford University, Department of Computer Science {nguha, mfchen, tmychow, iskhare, chrismre}@stanford.edu
Pseudocode Yes Algorithm 1 ESTIMATE SCORES
Open Source Code Yes Code for reproducing our results and using SMOOTHIE is available at https://github.com/ Hazy Research/smoothie.
Open Datasets Yes Table 5: Dataset name Huggingface URL E2E https://huggingface.co/datasets/e2e_nlg CNN/Daily Mail https://huggingface.co/datasets/cnn_dailymail SQu AD https://huggingface.co/datasets/hazyresearch/based-squad XSum https://huggingface.co/datasets/Edinburgh NLP/xsum Trivia QA https://huggingface.co/datasets/mandarjoshi/trivia_qa Web NLG https://huggingface.co/datasets/web_nlg Definition Extraction https://huggingface.co/datasets/nguha/legalbench
Dataset Splits Yes We sample a small labeled validation set (50 samples) and select the LLM that performs the best on this set. To account for sampling variation, we repeat this with 10 random draws and report the average performance.
Hardware Specification No No specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running the experiments were provided. The paper mentions "Microsoft Accelerate Foundation Models Research Program for providing portions of the compute" and lists several supporting organizations, but without specific hardware specifications.
Software Dependencies Yes For all datasets, we apply SMOOTHIE-GLOBAL using Sentence BERT (all-mpnet-base-v2) embeddings of generations [76].
Experiment Setup Yes For all tasks other than Definition Extraction we evaluate SMOOTHIE-GLOBAL on a 1000 sample subset. For these tasks, we consider two ensembles of LLMs at different size points. At the 3B size point, our ensemble consists of Pythia-2.8B [7], Gemma-2B [91], Incite-3B [17], and Dolly-3B [18]. At the 7B size point, our ensemble consists of Llama-2 [92], Mistral [40], Vicuna [107], Gemma-7B [91], and Nous Capybara [19]. We manually write a single prompt template for each task, and all model generations rely on this template.