AutoMix: Automatically Mixing Language Models
Authors: Pranjal Aggarwal, Aman Madaan, Ankit Anand, Srividya Pranavi Potharaju, Swaroop Mishra, Pei Zhou, Aditya Gupta, Dheeraj Rajagopal, Karthik Kappaganthu, Yiming Yang, Shyam Upadhyay, Manaal Faruqui, Mausam
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments across five language models and five challenging datasets show that Auto Mix consistently surpasses strong baselines, reducing computational cost by over 50% for comparable performance. |
| Researcher Affiliation | Collaboration | Carnegie Mellon University x AI Google Google Deep Mind IIT Delhi University of Southern California automix-models@googlegroups.com |
| Pseudocode | Yes | Algorithm 1 Unweighted Particle Filtering Update |
| Open Source Code | Yes | 1Code available at github.com/automix-llm/automix |
| Open Datasets | Yes | We experiment with a diverse set of datasets: i) QASPER [Dasigi et al., 2021]: Question answering over research papers; ii) QUALITY [Pang et al., 2022]: Multiple-choice questions (MCQ) on long articles and stories; iii) COQA [Reddy et al., 2019]: Conversational comprehension requiring coreference and pragmatic reasoning; iv) MUTUAL [Cui et al., 2020]: Multi-turn dialogue reasoning (next response prediction); v) DIPLOMAT [Li et al., 2023]: Pragmatic identification and reasoning questions on multi-turn dialogues. |
| Dataset Splits | Yes | We use the default validation splits and utilize prompts from Shaham et al. [2023] for QASPER and QUALITY, and adapt the QUALITY prompt for other datasets. |
| Hardware Specification | No | For running our experiments, we use LLAMA2-13B and GPT-4 models from huggingface3. We use vllm [Kwon et al., 2023] for hosting models for inference. ... We thank the IIT Delhi HPC facility for its computational resources. |
| Software Dependencies | No | We use vllm [Kwon et al., 2023] for hosting models for inference. |
| Experiment Setup | Yes | We use greedy decoding (temperature 0) and draw a single sample for both the SLM and LLM. For verification, we generate eight samples per question (temperature = 1), which has negligible cost owing to the large context. |