Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Activation-Informed Merging of Large Language Models

Authors: Amin Heyrani Nobari, Kaveh Alimohammadi, Ali ArjomandBigdeli, Akash Srivastava, Faez Ahmed, Navid Azizan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically demonstrate that AIM significantly enhances the performance of merged models across multiple benchmarks. Our findings suggest that considering the activationspace information can provide substantial advancements in the model merging strategies for LLMs with up to 40% increase in benchmark performance.
Researcher Affiliation	Collaboration	1Massachusetts Institute of Technology 2Stony Brook University 3MIT-IBM Watson AI Lab & Red Hat AI Innovation
Pseudocode	No	The paper describes the methodology using mathematical formulas (e.g., Equation 3, 4, 5) and textual descriptions in Section 3, but does not contain a dedicated 'Pseudocode' or 'Algorithm' block.
Open Source Code	Yes	Our code is publicly available at https://github.com/ahnobari/Activation Informed Merging.
Open Datasets	Yes	We choose the calibration dataset to be a subset of the validation data from the pile dataset [13]... calibration data can be found at https://huggingface.co/datasets/mit-han-lab/pile-val-backup.
Dataset Splits	Yes	We choose the calibration dataset to be a subset of the validation data from the pile dataset [13]... we use the same 256 total sequences (approximately 524K tokens) that the authors of both studies use in their main experiments... For all benchmarks, we use the latest versions and up-to-date implementations developed by Gao et al. [14] except for mathematical reasoning, for which we use the chain of thought prompting used by Luo et al. [24] to replicate the results of the original model as closely as possible.
Hardware Specification	Yes	All experiments are run using 4 H100 GPUs
Software Dependencies	No	The paper mentions using 'Merge Kit implementations developed by Goddard et al. [15]' but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	For this experiment, we used ω = 0.4, which we found to be the best balance of performance among the various merging methods we use. ... In many of the merging algorithms, many hyperparameters can be adjusted. In these cases, we use the author-recommended values where available and the default parameters recommended by Goddard et al. [15].