Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement

Authors: Xiyao Wang, Zhengyuan Yang, Chao Feng, Hongjin Lu, Linjie Li, Chung-Ching Lin, Kevin Lin, Furong Huang, Lijuan Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We systematically evaluate Think Lite-VL on several commonly used multimodal benchmark datasets and perform comprehensive comparisons with existing reasoning models. Through these experiments, we demonstrate the effectiveness and advantages of our model in multimodal reasoning tasks.
Researcher Affiliation	Collaboration	Xiyao Wang1,2 , Zhengyuan Yang2, Chao Feng3,4, Hongjin Lu1 Linjie Li2, Chung-Ching Lin2, Kevin Lin2, Furong Huang1, , Lijuan Wang2, 1University of Maryland, College Park 2Microsoft 3University of Michigan 4Cornell University EMAIL Equal advise
Pseudocode	No	The paper describes the MCTS procedure through detailed textual explanations and formulas (e.g., selection formula st+1 = arg maxst N(st) 1+N(st+1)), but it does not present a clearly labeled pseudocode block or algorithm box. The steps are integrated into the main text.
Open Source Code	Yes	Our code, data, and model are available at https://github.com/si0wang/Think Lite-VL.
Open Datasets	Yes	We collect a total of 70k datas from widely used open-source training datasets as our initial training set, covering three category: multimodel mathematical reasoning (Geometry3K [42], Geo QA [6], Geos [60]), natural image understanding (Figure QA [25], Science QA [43], OK-VQA [48]), and chart understanding (Icon QA [45], Tab MWP [44]).
Dataset Splits	Yes	Ultimately, we select all samples with K greater than 5, as well as those that remained unsolved after 50 iterations, resulting in a final training set of 11k samples with 7B model and 7.5k samples with 72B model. The data difficulty distribution of 11k training set of 7B model is shown in Figure 4 as an example. We select eight widely used VLM benchmarks for evaluation, namely Math Vista [41], Math Vison [69], Math Verse [96], MMMU [93], MMStar [8], MMBench [40], MMVet [91], and AI2D [26].
Hardware Specification	Yes	For all models, we use 8 80G A100 GPUs for model training and evaluation.
Software Dependencies	No	The paper mentions using "Easy-R1 [101] code base" and specific Qwen2.5-VL models, but does not provide version numbers for general ancillary software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	We conduct training using Easy-R1 [101] code base and set GRPO rollout number as 32. Our pipeline begins with 70k open-source samples...For each example, we simulate an MCTS-based inference trace using the base VLM, and rank samples by the number of reasoning steps required to reach a correct solution. From this pool, we extract two difficulty-filtered subsets: 11k samples for Qwen2.5-VL-7B-Instruct and 7.5k samples for Qwen2.5-VL-72B-Instruct. The diversity among these actions is regulated by temperature parameter, which is set to 0.5 in our experiments, with k configured as 3. Ultimately, we select all samples with K greater than 5, as well as those that remained unsolved after 50 iterations. We provide the training prompt template during RFT in Appendix A Table 8.