Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
CABS: Conflict-Aware and Balanced Sparsification for Enhancing Model Merging
Authors: Zongzhen Yang, Binhang Qi, Hailong Sun, Wenrui Long, Ruobing Zhao, Xiang Gao
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our comprehensive experiments demonstrate that CABS outperforms state-of-the-art methods across diverse tasks and model sizes. |
| Researcher Affiliation | Academia | 1State Key Laboratory of Complex & Critical Software Environment (CCSE), Beihang University, Beijing, China 2Hangzhou Innovation Institute of Beihang University, Hangzhou, China 3National University of Singapore, Singapore, Singapore . Correspondence to: Hailong Sun <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 CABS Input: Task vectors τA, τB, base model Wbase, sparsity level n , m, scaling coefficients λA , λB Output: Parameters of the merged model Wfinal 1: Apply n:m pruning to τA and compute mask A # include BS 2: τB remaining = τB (1 mask A) to eliminate overlap with τA # core step of CA 3: Apply n:m pruning to τB remaining to compute mask B # include BS 4: Merge the pruned vectors with the base model: Wfinal = Wbase + λA mask A τA + λB mask B τB 5: Return Wfinal |
| Open Source Code | Yes | Our code is available at https://github.com/ zongzhenyang/CABS. |
| Open Datasets | Yes | For large-scale model evaluation, we utilized the LLM Leaderboard benchmark, encompassing six key tasks: AI2 Reasoning Challenge (Clark et al., 2018), Hella Swag (Zellers et al., 2019), MMLU (Hendrycks et al., 2020), Truthful QA (Lin et al., 2022), Winogrande (Sakaguchi et al., 2021), and GSM8K (Cobbe et al., 2021). ... Additionally, major enterprises have employed model merging techniques in the development of pre-training models, such as Llama3 (Dubey et al., 2024) and Qwen2 (Yang et al., 2024a; Lu et al., 2024), to enhance generalization capabilities and improve performance across a range of tasks. ... For evaluating small-scale models, we utilized the GLUE benchmark, which includes four binary classification tasks: Co LA (Warstadt et al., 2019), SST-2 (Socher et al., 2013), MRPC (Dolan & Brockett, 2005), and RTE (Dagan et al., 2005; Bar-Haim et al., 2006; Giampiccolo et al., 2007; Bentivogli et al., 2009). To increase task difficulty and diversity, we also included the multiple-choice reading comprehension task RACE (Lai et al., 2017) and the question-answering task SQu AD (Rajpurkar, 2016). |
| Dataset Splits | Yes | Due to the unavailability of test labels, the original validation sets were repurposed as test sets. ... For LLM Leaderboard tasks, the following metrics were used: ARC: Success rate (25-shot) Hella Swag: Accuracy (10-shot) MMLU and Winogrande: Accuracy (5-shot) Truthful QA: Factual accuracy (0-shot) GSM8K: Success rate (5-shot). |
| Hardware Specification | Yes | The model evaluations were performed on A100-40GB GPUs. |
| Software Dependencies | Yes | inference was implemented via the lm-evaluation-harness v0.4.0. |
| Experiment Setup | Yes | For small-scale tasks, we performed a fine-grained λ parameter search with an interval of 0.01 (compared to 0.1 used in previous works) to ensure fair comparisons between methods. In contrast, because of the high computational cost of large-scale experiments (e.g., with 7B models), we followed prior work by adopting a coarser grid interval of 0.1, with equal λ values for all vectors. ... As for the hyperparameters of generative LMs, we set the maximum generation token limit to 256, the temperature to 1.0 for sampling, and the maximum context length to 2048 tokens. |