Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging

Authors: Shiqi Chen, Jinghan Zhang, Tongyao Zhu, Wei Liu, Siyang Gao, Miao Xiong, Manling Li, Junxian He

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments, we demonstrate that model merging offers a successful pathway to transfer reasoning abilities from LLMs to VLMs in a training-free manner. We conduct extensive experiments by merging commonly used VLMs with LLMs trained on diverse math reasoning datasets (4). Our findings demonstrate that model merging consistently improves the reasoning capabilities of VLMs across all math benchmarks, with minimal impact on the perception-dominant tasks. We evaluate the performance on a series of VLM benchmarks.
Researcher Affiliation	Academia	1City University of Hong Kong 2Hong Kong University of Science and Technology 3National University of Singapore 4Northwestern University.
Pseudocode	No	The paper describes methods and equations but does not contain any clearly labeled pseudocode or algorithm blocks. For example, equation (1) defines `τtask = θft θbase` and equation (2) defines `τvlm = θvlm θbase`, and equation (3) defines `θ vlm = θbase + λτvlm + (1 λ)τreason` but these are mathematical expressions not pseudocode.
Open Source Code	Yes	Our code is publicly available at: https://github.com/shiqichen17/VLM Merging.
Open Datasets	Yes	We evaluate the performance on a series of VLM benchmarks. We apply five benchmarks: Math Vista (Lu et al., 2024), Math Verse (Zhang et al., 2025), Math Vision (Wang et al., 2024a), Dynamath (Zou et al., 2024) and MMStar (Chen et al., 2024a).
Dataset Splits	Yes	We evaluate the performance on a series of VLM benchmarks. We apply five benchmarks: Math Vista (Lu et al., 2024), Math Verse (Zhang et al., 2025), Math Vision (Wang et al., 2024a), Dynamath (Zou et al., 2024) and MMStar (Chen et al., 2024a). Among these benchmarks, Math Vista is a diverse benchmark that includes both math-related reasoning tasks and general visual question answering tasks.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, processor types, memory amounts) used for running its experiments.
Software Dependencies	No	The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers like Python 3.8, PyTorch 1.9) needed to replicate the experiment.
Experiment Setup	Yes	Hyperparameters In our main analysis and experimental sections, we employ a linear merging strategy for all task vectors under the same hyperparameter settings to ensure a fair comparison. This approach assigns a weight of 0.9 to the textual component of LLa VA-Next-LLa MA3-8B and 0.1 to the reasoning task vector, where λ = 0.9. This parameter is tuned on Math Vista based on Dart-Prop (Tong et al., 2024). We choose the best value from the range (0.8, 0.85, 0.9).