Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Determine-Then-Ensemble: Necessity of Top-k Union for Large Language Model Ensembling

Authors: Yuxuan YAO, Han Wu, Mingyang LIU, Sichun Luo, Xiongwei Han, Jie Liu, Zhijiang Guo, Linqi Song

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive evaluations across multiple benchmarks demonstrate that UNITE significantly enhances performance compared to existing methods, offering a more efficient framework for LLM ensembling.
Researcher Affiliation	Collaboration	Yuxuan Yao1,2, Han Wu3 , Mingyang Liu1,2, Sichun Luo1,2, Xiongwei Han3, Jie Liu4 Zhijiang Guo5, Linqi Song1,2 1Department of Computer Science, City University of Hong Kong 2City University of Hong Kong Shenzhen Research Institute 3Huawei Noah s Ark Lab 4North China University of Technology 5Hong Kong University of Science and Technology (Guangzhou)
Pseudocode	Yes	As presented in Algorithm 1,UNITE integrates the Top-k tokens generated by multiple base models to create a joint set, updating the tokens based on probability rules, and ultimately uses a greedy strategy to select the next token from this set, repeating the process until a stopping criterion is met.
Open Source Code	Yes	The code is available at https://github.com/starrYYxuan/UniTE
Open Datasets	Yes	Benchmarks We evaluate six benchmarks, which can be categorized into three main groups. 1) Comprehensive Examination: MMLU (5-shot) (Hendrycks et al., 2021), covering 57 subjects that humans typically learn; ARC-C (0-shot) (Clark et al., 2018), collected from standardized natural science tests. 2) Reasoning Capabilities: GSM8K (Cobbe et al., 2021) (4-shot), a dataset of highquality problems at the grade school math level; PIQA (Bisk et al., 2020) (0-shot), a commonsense reasoning dataset. 3) Knowledge Capacities: Trivia QA (5-shot) (Joshi et al., 2017), compiled by Trivia enthusiasts; Natural Question (NQ) (5-shot) (Kwiatkowski et al., 2019), a question-answering corpus consisting of queries issued to the Google search engine.
Dataset Splits	No	Benchmarks We evaluate six benchmarks, which can be categorized into three main groups. 1) Comprehensive Examination: MMLU (5-shot) (Hendrycks et al., 2021), covering 57 subjects that humans typically learn; ARC-C (0-shot) (Clark et al., 2018)... (This refers to few-shot prompting, not explicit dataset splits for reproduction).
Hardware Specification	Yes	All experiments are conducted on 46G NVIDIA L40 GPUs.
Software Dependencies	No	The paper does not provide specific ancillary software details with version numbers.
Experiment Setup	Yes	The hyper-parameter k is set to 10 in this work.