Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Determine-Then-Ensemble: Necessity of Top-k Union for Large Language Model Ensembling
Authors: Yuxuan YAO, Han Wu, Mingyang LIU, Sichun Luo, Xiongwei Han, Jie Liu, Zhijiang Guo, Linqi Song
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive evaluations across multiple benchmarks demonstrate that UNITE significantly enhances performance compared to existing methods, offering a more efficient framework for LLM ensembling. |
| Researcher Affiliation | Collaboration | Yuxuan Yao1,2, Han Wu3 , Mingyang Liu1,2, Sichun Luo1,2, Xiongwei Han3, Jie Liu4 Zhijiang Guo5, Linqi Song1,2 1Department of Computer Science, City University of Hong Kong 2City University of Hong Kong Shenzhen Research Institute 3Huawei Noah s Ark Lab 4North China University of Technology 5Hong Kong University of Science and Technology (Guangzhou) |
| Pseudocode | Yes | As presented in Algorithm 1,UNITE integrates the Top-k tokens generated by multiple base models to create a joint set, updating the tokens based on probability rules, and ultimately uses a greedy strategy to select the next token from this set, repeating the process until a stopping criterion is met. |
| Open Source Code | Yes | The code is available at https://github.com/starrYYxuan/UniTE |
| Open Datasets | Yes | Benchmarks We evaluate six benchmarks, which can be categorized into three main groups. 1) Comprehensive Examination: MMLU (5-shot) (Hendrycks et al., 2021), covering 57 subjects that humans typically learn; ARC-C (0-shot) (Clark et al., 2018), collected from standardized natural science tests. 2) Reasoning Capabilities: GSM8K (Cobbe et al., 2021) (4-shot), a dataset of highquality problems at the grade school math level; PIQA (Bisk et al., 2020) (0-shot), a commonsense reasoning dataset. 3) Knowledge Capacities: Trivia QA (5-shot) (Joshi et al., 2017), compiled by Trivia enthusiasts; Natural Question (NQ) (5-shot) (Kwiatkowski et al., 2019), a question-answering corpus consisting of queries issued to the Google search engine. |
| Dataset Splits | No | Benchmarks We evaluate six benchmarks, which can be categorized into three main groups. 1) Comprehensive Examination: MMLU (5-shot) (Hendrycks et al., 2021), covering 57 subjects that humans typically learn; ARC-C (0-shot) (Clark et al., 2018)... (This refers to few-shot prompting, not explicit dataset splits for reproduction). |
| Hardware Specification | Yes | All experiments are conducted on 46G NVIDIA L40 GPUs. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers. |
| Experiment Setup | Yes | The hyper-parameter k is set to 10 in this work. |