Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Activation-Informed Merging of Large Language Models
Authors: Amin Heyrani Nobari, Kaveh Alimohammadi, Ali ArjomandBigdeli, Akash Srivastava, Faez Ahmed, Navid Azizan
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically demonstrate that AIM significantly enhances the performance of merged models across multiple benchmarks. Our findings suggest that considering the activationspace information can provide substantial advancements in the model merging strategies for LLMs with up to 40% increase in benchmark performance. |
| Researcher Affiliation | Collaboration | 1Massachusetts Institute of Technology 2Stony Brook University 3MIT-IBM Watson AI Lab & Red Hat AI Innovation |
| Pseudocode | No | The paper describes the methodology using mathematical formulas (e.g., Equation 3, 4, 5) and textual descriptions in Section 3, but does not contain a dedicated 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | Yes | Our code is publicly available at https://github.com/ahnobari/Activation Informed Merging. |
| Open Datasets | Yes | We choose the calibration dataset to be a subset of the validation data from the pile dataset [13]... calibration data can be found at https://huggingface.co/datasets/mit-han-lab/pile-val-backup. |
| Dataset Splits | Yes | We choose the calibration dataset to be a subset of the validation data from the pile dataset [13]... we use the same 256 total sequences (approximately 524K tokens) that the authors of both studies use in their main experiments... For all benchmarks, we use the latest versions and up-to-date implementations developed by Gao et al. [14] except for mathematical reasoning, for which we use the chain of thought prompting used by Luo et al. [24] to replicate the results of the original model as closely as possible. |
| Hardware Specification | Yes | All experiments are run using 4 H100 GPUs |
| Software Dependencies | No | The paper mentions using 'Merge Kit implementations developed by Goddard et al. [15]' but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | For this experiment, we used ω = 0.4, which we found to be the best balance of performance among the various merging methods we use. ... In many of the merging algorithms, many hyperparameters can be adjusted. In these cases, we use the author-recommended values where available and the default parameters recommended by Goddard et al. [15]. |