Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Can MLLMs Absorb Math Reasoning Abilities from LLMs as Free Lunch?

Authors: Yijie Hu, Zihao Zhou, Kaizhu Huang, Xiaowei Huang, Qiufeng Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that our IP-Merging method can enhance the math reasoning ability of MLLMs directly from Math LLMs without compromising their other capabilities. We demonstrate the effectiveness of our method by merging MLLM (e.g., LLa VA series and Qwen series) and different math reasoning LLMs. We validate the performance of our method on Math Vista, Math Verse, Dyna Math and Math Vision for evaluating math reasoning abilities. We further show our method does not interfere with the model s other abilities by evaluating our method general knowledge datasets, i.e., MMMU [50], Text VQA [31] and MMBench [22].
Researcher Affiliation	Academia	Yijie Hu1,2 , Zihao Zhou1,2 , Kaizhu Huang3, Xiaowei Huang2, Qiufeng Wang1 1 Xi an-Jiaotong Liverpool University 2 University of Liverpool 3Duke Kunshan University
Pseudocode	Yes	Algorithm 1 IP-Merging
Open Source Code	Yes	3Code Repository: https://github.com/tambourine666/Merge VLM
Open Datasets	Yes	We test our models math reasoning benchmarks Math Vista [23], Math Verse [53], Dyna Math [59] and Math Vision [36], general QA benchmarks MMMU [50], Text VQA [31] and MMBench [22].
Dataset Splits	Yes	We test our models on six benchmarks, i.e., Math Vista [23], Math Verse [53], Dyna Math (DM) [59], Math Vision [36] and three general QA benchmarks MMMU [50], Text VQA [31] and MMBench [22]. Math Vista... For evaluation, following [23, 30], we first employ GPT-4 to extract the predicted choices or answers from responses, then report the answer accuracy... MMMU includes 900 evaluation samples and covers six core disciplines...
Hardware Specification	Yes	We use RTX 3090 GPUs for all of our experiments. We conduct this experiment of merging 7B MLLM and one math reasoning LLM on a GPU server with 8-card Nvidia RTX 3090.
Software Dependencies	No	The paper mentions using LLa VA series, Qwen 2 series, Intern VL3 series as base models, and fine-tuned Math LLMs such as Tora series models, Meta Math models, and Qwen2-Math models. It also states GPT-4 is used for extracting predicted choices or answers from responses during evaluation. However, it does not specify version numbers for general software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	Task Arithmetic [15] involves the scaling coefficients for merged task vectors, which are set ranging from [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]. Ties Merging [41] involves the scaling coefficient and ratio to retain large parameters, the scaling coefficients are set ranging from [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], ratio to retain parameters with largest-magnitude values: [0.1, 0.2, 0.3]. EMR Merging [14] does not involve specific hyperparameters. IP Merging involves the similarity threshold to determine whether the layer should be selected for merging.