Smoothie: Label Free Language Model Routing
Authors: Neel Guha, Mayee Chen, Trevor Chow, Ishan Khare, Christopher Ré
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically evaluate SMOOTHIE in three stages. |
| Researcher Affiliation | Academia | Neel Guha Mayee F. Chen Trevor Chow Ishan S. Khare Christopher Ré Stanford University, Department of Computer Science {nguha, mfchen, tmychow, iskhare, chrismre}@stanford.edu |
| Pseudocode | Yes | Algorithm 1 ESTIMATE SCORES |
| Open Source Code | Yes | Code for reproducing our results and using SMOOTHIE is available at https://github.com/ Hazy Research/smoothie. |
| Open Datasets | Yes | Table 5: Dataset name Huggingface URL E2E https://huggingface.co/datasets/e2e_nlg CNN/Daily Mail https://huggingface.co/datasets/cnn_dailymail SQu AD https://huggingface.co/datasets/hazyresearch/based-squad XSum https://huggingface.co/datasets/Edinburgh NLP/xsum Trivia QA https://huggingface.co/datasets/mandarjoshi/trivia_qa Web NLG https://huggingface.co/datasets/web_nlg Definition Extraction https://huggingface.co/datasets/nguha/legalbench |
| Dataset Splits | Yes | We sample a small labeled validation set (50 samples) and select the LLM that performs the best on this set. To account for sampling variation, we repeat this with 10 random draws and report the average performance. |
| Hardware Specification | No | No specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running the experiments were provided. The paper mentions "Microsoft Accelerate Foundation Models Research Program for providing portions of the compute" and lists several supporting organizations, but without specific hardware specifications. |
| Software Dependencies | Yes | For all datasets, we apply SMOOTHIE-GLOBAL using Sentence BERT (all-mpnet-base-v2) embeddings of generations [76]. |
| Experiment Setup | Yes | For all tasks other than Definition Extraction we evaluate SMOOTHIE-GLOBAL on a 1000 sample subset. For these tasks, we consider two ensembles of LLMs at different size points. At the 3B size point, our ensemble consists of Pythia-2.8B [7], Gemma-2B [91], Incite-3B [17], and Dolly-3B [18]. At the 7B size point, our ensemble consists of Llama-2 [92], Mistral [40], Vicuna [107], Gemma-7B [91], and Nous Capybara [19]. We manually write a single prompt template for each task, and all model generations rely on this template. |