Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

CABS: Conflict-Aware and Balanced Sparsification for Enhancing Model Merging

Authors: Zongzhen Yang, Binhang Qi, Hailong Sun, Wenrui Long, Ruobing Zhao, Xiang Gao

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our comprehensive experiments demonstrate that CABS outperforms state-of-the-art methods across diverse tasks and model sizes.
Researcher Affiliation	Academia	1State Key Laboratory of Complex & Critical Software Environment (CCSE), Beihang University, Beijing, China 2Hangzhou Innovation Institute of Beihang University, Hangzhou, China 3National University of Singapore, Singapore, Singapore . Correspondence to: Hailong Sun <EMAIL>.
Pseudocode	Yes	Algorithm 1 CABS Input: Task vectors τA, τB, base model Wbase, sparsity level n , m, scaling coefficients λA , λB Output: Parameters of the merged model Wfinal 1: Apply n:m pruning to τA and compute mask A # include BS 2: τB remaining = τB (1 mask A) to eliminate overlap with τA # core step of CA 3: Apply n:m pruning to τB remaining to compute mask B # include BS 4: Merge the pruned vectors with the base model: Wfinal = Wbase + λA mask A τA + λB mask B τB 5: Return Wfinal
Open Source Code	Yes	Our code is available at https://github.com/ zongzhenyang/CABS.
Open Datasets	Yes	For large-scale model evaluation, we utilized the LLM Leaderboard benchmark, encompassing six key tasks: AI2 Reasoning Challenge (Clark et al., 2018), Hella Swag (Zellers et al., 2019), MMLU (Hendrycks et al., 2020), Truthful QA (Lin et al., 2022), Winogrande (Sakaguchi et al., 2021), and GSM8K (Cobbe et al., 2021). ... Additionally, major enterprises have employed model merging techniques in the development of pre-training models, such as Llama3 (Dubey et al., 2024) and Qwen2 (Yang et al., 2024a; Lu et al., 2024), to enhance generalization capabilities and improve performance across a range of tasks. ... For evaluating small-scale models, we utilized the GLUE benchmark, which includes four binary classification tasks: Co LA (Warstadt et al., 2019), SST-2 (Socher et al., 2013), MRPC (Dolan & Brockett, 2005), and RTE (Dagan et al., 2005; Bar-Haim et al., 2006; Giampiccolo et al., 2007; Bentivogli et al., 2009). To increase task difficulty and diversity, we also included the multiple-choice reading comprehension task RACE (Lai et al., 2017) and the question-answering task SQu AD (Rajpurkar, 2016).
Dataset Splits	Yes	Due to the unavailability of test labels, the original validation sets were repurposed as test sets. ... For LLM Leaderboard tasks, the following metrics were used: ARC: Success rate (25-shot) Hella Swag: Accuracy (10-shot) MMLU and Winogrande: Accuracy (5-shot) Truthful QA: Factual accuracy (0-shot) GSM8K: Success rate (5-shot).
Hardware Specification	Yes	The model evaluations were performed on A100-40GB GPUs.
Software Dependencies	Yes	inference was implemented via the lm-evaluation-harness v0.4.0.
Experiment Setup	Yes	For small-scale tasks, we performed a fine-grained λ parameter search with an interval of 0.01 (compared to 0.1 used in previous works) to ensure fair comparisons between methods. In contrast, because of the high computational cost of large-scale experiments (e.g., with 7B models), we followed prior work by adopting a coarser grid interval of 0.1, with equal λ values for all vectors. ... As for the hyperparameters of generative LMs, we set the maximum generation token limit to 256, the temperature to 1.0 for sampling, and the maximum context length to 2048 tokens.