Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Activation-Guided Consensus Merging for Large Language Models

Authors: Yuxuan Yao, Shuqi LIU, Zehua Liu, Qintong Li, Mingyang LIU, Xiongwei Han, Zhijiang Guo, Han Wu, Linqi Song

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on Long-to-Short (L2S) and general merging tasks demonstrate that ACM consistently outperforms all baseline methods. For instance, in the case of Qwen-7B models, TIES-Merging equipped with ACM achieves a 55.3% reduction in response length while simultaneously improving reasoning accuracy by 1.3 points. Our code is available at ACM
Researcher Affiliation	Collaboration	Yuxuan Yao1,2, Shuqi Liu3, Zehua Liu3, Qintong Li4, Mingyang Liu1,2, Xiongwei Han3, Zhijiang Guo5,6, Han Wu3 , Linqi Song1,2 1Department of Computer Science, City University of Hong Kong 2City University of Hong Kong Shenzhen Research Institute 3Huawei Noah s Ark Lab, Hong Kong SAR 4University of Hong Kong 5Hong Kong University of Science and Technology (Guangzhou) 6Hong Kong University of Science and Technology
Pseudocode	No	The paper describes the method using prose and mathematical equations in Section 3 and shows an overall framework diagram in Figure 1, but it does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is available at ACM... Please refer to the submitted code in support materials, and it will be publicly available.
Open Datasets	Yes	Model performance is assessed on established reasoning datasets, including GSM8K [4], MATH500 [17], Minerva Math [15], Olympiadbench [9], College Math [36], and AIME 20241. To ensure reproducibility, we employed the public evaluation toolkit provided by Qwen LM 2, adhering to their recommended versions of dependencies. Code generation capabilities are measured on Human Eval Pro [47] and Live Code Bench [12]. For activation-based merging, the s1K dataset [25] is used as the calibration data... We have also chosen the recently released LIMO [44] dataset as the calibration dataset for our experiments.
Dataset Splits	No	For activation-based merging, the s1K dataset [25] is used as the calibration data, which provides aligned shortand long-Co T answers for each question. The dataset containing about 1000 pieces of data is first clustered by the K-means algorithm into 20 categories, followed by an even sampling of 10% of the total data.
Hardware Specification	No	Due to constrained computational resources, we did not conduct evaluations on extremely large-scale models, such as LLa MA-3.1-70B. The paper does not specify any particular GPU, CPU models, or detailed computer specifications used for experiments.
Software Dependencies	No	To ensure reproducibility, we employed the public evaluation toolkit provided by Qwen LM 2, adhering to their recommended versions of dependencies.
Experiment Setup	Yes	The maximum sequence lengths for quick and slow-thinking models are set to 8K and 10K, respectively. All models are uploaded and evaluated with the BF16 data type. We report the average scores across five runs with different random seeds. ... When applying Task Arithmetic, we utilize a scaling coefficient of λ = 0.7 by default. ... to obtain stable performance with DARE, we configure its scaling coefficient to λ = 0.7 and its default drop rate to r = 0.3. For AIM, we utilize ω = 0.4 as recommended. ... For our ACM, we set t = 0.7. ... To ensure the merged model retains the capabilities of all individual models, we set the hyperparameter for the TA method to 0.2, while the TIES method uses coefficients of 0.7 and 0.2, respectively. Accordingly, the hyperparameter t for our ACM method is set to -1.8.