Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Determine-Then-Ensemble: Necessity of Top-k Union for Large Language Model Ensembling

Authors: Yuxuan YAO, Han Wu, Mingyang LIU, Sichun Luo, Xiongwei Han, Jie Liu, Zhijiang Guo, Linqi Song

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive evaluations across multiple benchmarks demonstrate that UNITE significantly enhances performance compared to existing methods, offering a more efficient framework for LLM ensembling.
Researcher Affiliation Collaboration Yuxuan Yao1,2, Han Wu3 , Mingyang Liu1,2, Sichun Luo1,2, Xiongwei Han3, Jie Liu4 Zhijiang Guo5, Linqi Song1,2 1Department of Computer Science, City University of Hong Kong 2City University of Hong Kong Shenzhen Research Institute 3Huawei Noah s Ark Lab 4North China University of Technology 5Hong Kong University of Science and Technology (Guangzhou)
Pseudocode Yes As presented in Algorithm 1,UNITE integrates the Top-k tokens generated by multiple base models to create a joint set, updating the tokens based on probability rules, and ultimately uses a greedy strategy to select the next token from this set, repeating the process until a stopping criterion is met.
Open Source Code Yes The code is available at https://github.com/starrYYxuan/UniTE
Open Datasets Yes Benchmarks We evaluate six benchmarks, which can be categorized into three main groups. 1) Comprehensive Examination: MMLU (5-shot) (Hendrycks et al., 2021), covering 57 subjects that humans typically learn; ARC-C (0-shot) (Clark et al., 2018), collected from standardized natural science tests. 2) Reasoning Capabilities: GSM8K (Cobbe et al., 2021) (4-shot), a dataset of highquality problems at the grade school math level; PIQA (Bisk et al., 2020) (0-shot), a commonsense reasoning dataset. 3) Knowledge Capacities: Trivia QA (5-shot) (Joshi et al., 2017), compiled by Trivia enthusiasts; Natural Question (NQ) (5-shot) (Kwiatkowski et al., 2019), a question-answering corpus consisting of queries issued to the Google search engine.
Dataset Splits No Benchmarks We evaluate six benchmarks, which can be categorized into three main groups. 1) Comprehensive Examination: MMLU (5-shot) (Hendrycks et al., 2021), covering 57 subjects that humans typically learn; ARC-C (0-shot) (Clark et al., 2018)... (This refers to few-shot prompting, not explicit dataset splits for reproduction).
Hardware Specification Yes All experiments are conducted on 46G NVIDIA L40 GPUs.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers.
Experiment Setup Yes The hyper-parameter k is set to 10 in this work.