reproducibilityindex.ai

AutoMix: Automatically Mixing Language Models

Authors: Pranjal Aggarwal, Aman Madaan, Ankit Anand, Srividya Pranavi Potharaju, Swaroop Mishra, Pei Zhou, Aditya Gupta, Dheeraj Rajagopal, Karthik Kappaganthu, Yiming Yang, Shyam Upadhyay, Manaal Faruqui, Mausam

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments across five language models and five challenging datasets show that Auto Mix consistently surpasses strong baselines, reducing computational cost by over 50% for comparable performance.
Researcher Affiliation	Collaboration	Carnegie Mellon University x AI Google Google Deep Mind IIT Delhi University of Southern California automix-models@googlegroups.com
Pseudocode	Yes	Algorithm 1 Unweighted Particle Filtering Update
Open Source Code	Yes	1Code available at github.com/automix-llm/automix
Open Datasets	Yes	We experiment with a diverse set of datasets: i) QASPER [Dasigi et al., 2021]: Question answering over research papers; ii) QUALITY [Pang et al., 2022]: Multiple-choice questions (MCQ) on long articles and stories; iii) COQA [Reddy et al., 2019]: Conversational comprehension requiring coreference and pragmatic reasoning; iv) MUTUAL [Cui et al., 2020]: Multi-turn dialogue reasoning (next response prediction); v) DIPLOMAT [Li et al., 2023]: Pragmatic identification and reasoning questions on multi-turn dialogues.
Dataset Splits	Yes	We use the default validation splits and utilize prompts from Shaham et al. [2023] for QASPER and QUALITY, and adapt the QUALITY prompt for other datasets.
Hardware Specification	No	For running our experiments, we use LLAMA2-13B and GPT-4 models from huggingface3. We use vllm [Kwon et al., 2023] for hosting models for inference. ... We thank the IIT Delhi HPC facility for its computational resources.
Software Dependencies	No	We use vllm [Kwon et al., 2023] for hosting models for inference.
Experiment Setup	Yes	We use greedy decoding (temperature 0) and draw a single sample for both the SLM and LLM. For verification, we generate eight samples per question (temperature = 1), which has negligible cost owing to the large context.