reproducibilityindex.ai

Smoothie: Label Free Language Model Routing

Authors: Neel Guha, Mayee Chen, Trevor Chow, Ishan Khare, Christopher Ré

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically evaluate SMOOTHIE in three stages.
Researcher Affiliation	Academia	Neel Guha Mayee F. Chen Trevor Chow Ishan S. Khare Christopher Ré Stanford University, Department of Computer Science {nguha, mfchen, tmychow, iskhare, chrismre}@stanford.edu
Pseudocode	Yes	Algorithm 1 ESTIMATE SCORES
Open Source Code	Yes	Code for reproducing our results and using SMOOTHIE is available at https://github.com/ Hazy Research/smoothie.
Open Datasets	Yes	Table 5: Dataset name Huggingface URL E2E https://huggingface.co/datasets/e2e_nlg CNN/Daily Mail https://huggingface.co/datasets/cnn_dailymail SQu AD https://huggingface.co/datasets/hazyresearch/based-squad XSum https://huggingface.co/datasets/Edinburgh NLP/xsum Trivia QA https://huggingface.co/datasets/mandarjoshi/trivia_qa Web NLG https://huggingface.co/datasets/web_nlg Definition Extraction https://huggingface.co/datasets/nguha/legalbench
Dataset Splits	Yes	We sample a small labeled validation set (50 samples) and select the LLM that performs the best on this set. To account for sampling variation, we repeat this with 10 random draws and report the average performance.
Hardware Specification	No	No specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running the experiments were provided. The paper mentions "Microsoft Accelerate Foundation Models Research Program for providing portions of the compute" and lists several supporting organizations, but without specific hardware specifications.
Software Dependencies	Yes	For all datasets, we apply SMOOTHIE-GLOBAL using Sentence BERT (all-mpnet-base-v2) embeddings of generations [76].
Experiment Setup	Yes	For all tasks other than Definition Extraction we evaluate SMOOTHIE-GLOBAL on a 1000 sample subset. For these tasks, we consider two ensembles of LLMs at different size points. At the 3B size point, our ensemble consists of Pythia-2.8B [7], Gemma-2B [91], Incite-3B [17], and Dolly-3B [18]. At the 7B size point, our ensemble consists of Llama-2 [92], Mistral [40], Vicuna [107], Gemma-7B [91], and Nous Capybara [19]. We manually write a single prompt template for each task, and all model generations rely on this template.