Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

HM3: Hierarchical Multi-Objective Model Merging for Pretrained Models

Authors: Yu Zhou, Xingyu Wu, Jibin Wu, Liang Feng, KC Tan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on language and vision tasks demonstrate that HM3 outperforms methods focusing solely on the parameter or architecture space.
Researcher Affiliation	Academia	1Department of Data Science and Artificial Intelligence The Hong Kong Polytechnic University, Hong Kong SAR 2Department of Computing, The Hong Kong Polytechnic University, Hong Kong SAR 3College of Computer Science, Chongqing University, Chongqing, China EMAIL EMAIL EMAIL
Pseudocode	Yes	The overall algorithm of HM3 is summarized as Algorithm 1.
Open Source Code	Yes	2The implementation of HM3 is available at available at this page.
Open Datasets	Yes	For language tasks, we used LLAMA-2-7B [60], Qwen-2.5-1.5B [75], and LLAMA-2-13B [60] as backbones across four subtasks: generative task, text translation, math reasoning, and code generation. For generative tasks, we used GLUE benchmark [63] to evaluate the general capability of large pretrained models. For translation, we used WMT14, WMT16 [50], and IWSLT2017 [7] (WMT&ISWT), evaluated by the chrf metric as well as Xnli [15] evaluated by the accuracy metric. For math reasoning, we used GSM8K [12] with the flexible match metric, and used Math QA [3] with the accuracy metric. For code generation, Human Eval [9] and MBPP [5] was used with the pass@1 and pass@100 metric. For vision tasks, we adopted Vi T-B/32 and Vi T-L/14 from CLIP [46] as backbones, and evaluated on eight datasets: DTD [11], GTSRB [52], RESISC45 [10], SUN397 [70], SVHN [45], MNIST [32], Cars [30], and Euro SAT [23], using classification accuracy.
Dataset Splits	Yes	We split the dataset where 70% is used for RL inference evaluation, while the 30% is reserved for the evaluation of the obtained merged model.
Hardware Specification	Yes	Additionally, Qwen-2.5-1.5B was evaluated on four 3090 GPUs (24GB each), while LLa MA-2-7B and LLa MA-2-13B were evaluated on four A6000 GPUs (48GB each). All models can also be deployed on a single GPU. In this paper, we used A6000 or 3090 GPUs for RL training of HM3, whereas full fine-tuning typically requires A100-level GPU clusters.
Software Dependencies	No	The paper mentions software tools like 'lmevaluation-harness' and 'mergekit' but does not specify any version numbers for these or other software libraries (e.g., Python, PyTorch).
Experiment Setup	Yes	Additionally, for HM3, Maxiter is 1000 and the discount factor γ is configured to 0.990.