AutoMix: Automatically Mixing Language Models

Authors: Pranjal Aggarwal, Aman Madaan, Ankit Anand, Srividya Pranavi Potharaju, Swaroop Mishra, Pei Zhou, Aditya Gupta, Dheeraj Rajagopal, Karthik Kappaganthu, Yiming Yang, Shyam Upadhyay, Manaal Faruqui, Mausam

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments across five language models and five challenging datasets show that Auto Mix consistently surpasses strong baselines, reducing computational cost by over 50% for comparable performance.
Researcher Affiliation Collaboration Carnegie Mellon University x AI Google Google Deep Mind IIT Delhi University of Southern California automix-models@googlegroups.com
Pseudocode Yes Algorithm 1 Unweighted Particle Filtering Update
Open Source Code Yes 1Code available at github.com/automix-llm/automix
Open Datasets Yes We experiment with a diverse set of datasets: i) QASPER [Dasigi et al., 2021]: Question answering over research papers; ii) QUALITY [Pang et al., 2022]: Multiple-choice questions (MCQ) on long articles and stories; iii) COQA [Reddy et al., 2019]: Conversational comprehension requiring coreference and pragmatic reasoning; iv) MUTUAL [Cui et al., 2020]: Multi-turn dialogue reasoning (next response prediction); v) DIPLOMAT [Li et al., 2023]: Pragmatic identification and reasoning questions on multi-turn dialogues.
Dataset Splits Yes We use the default validation splits and utilize prompts from Shaham et al. [2023] for QASPER and QUALITY, and adapt the QUALITY prompt for other datasets.
Hardware Specification No For running our experiments, we use LLAMA2-13B and GPT-4 models from huggingface3. We use vllm [Kwon et al., 2023] for hosting models for inference. ... We thank the IIT Delhi HPC facility for its computational resources.
Software Dependencies No We use vllm [Kwon et al., 2023] for hosting models for inference.
Experiment Setup Yes We use greedy decoding (temperature 0) and draw a single sample for both the SLM and LLM. For verification, we generate eight samples per question (temperature = 1), which has negligible cost owing to the large context.