Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
AutoMix: Automatically Mixing Language Models
Authors: Pranjal Aggarwal, Aman Madaan, Ankit Anand, Srividya Pranavi Potharaju, Swaroop Mishra, Pei Zhou, Aditya Gupta, Dheeraj Rajagopal, Karthik Kappaganthu, Yiming Yang, Shyam Upadhyay, Manaal Faruqui, Mausam
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments across five language models and five challenging datasets show that Auto Mix consistently surpasses strong baselines, reducing computational cost by over 50% for comparable performance. |
| Researcher Affiliation | Collaboration | Carnegie Mellon University x AI Google Google Deep Mind IIT Delhi University of Southern California EMAIL |
| Pseudocode | Yes | Algorithm 1 Unweighted Particle Filtering Update |
| Open Source Code | Yes | 1Code available at github.com/automix-llm/automix |
| Open Datasets | Yes | We experiment with a diverse set of datasets: i) QASPER [Dasigi et al., 2021]: Question answering over research papers; ii) QUALITY [Pang et al., 2022]: Multiple-choice questions (MCQ) on long articles and stories; iii) COQA [Reddy et al., 2019]: Conversational comprehension requiring coreference and pragmatic reasoning; iv) MUTUAL [Cui et al., 2020]: Multi-turn dialogue reasoning (next response prediction); v) DIPLOMAT [Li et al., 2023]: Pragmatic identification and reasoning questions on multi-turn dialogues. |
| Dataset Splits | Yes | We use the default validation splits and utilize prompts from Shaham et al. [2023] for QASPER and QUALITY, and adapt the QUALITY prompt for other datasets. |
| Hardware Specification | No | For running our experiments, we use LLAMA2-13B and GPT-4 models from huggingface3. We use vllm [Kwon et al., 2023] for hosting models for inference. ... We thank the IIT Delhi HPC facility for its computational resources. |
| Software Dependencies | No | We use vllm [Kwon et al., 2023] for hosting models for inference. |
| Experiment Setup | Yes | We use greedy decoding (temperature 0) and draw a single sample for both the SLM and LLM. For verification, we generate eight samples per question (temperature = 1), which has negligible cost owing to the large context. |