Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Multi-step Visual Reasoning with Visual Tokens Scaling and Verification

Authors: Tianyi Bai, Zengjie Hu, Fupeng Sun, Qiu Jiantao, Yizhen Jiang, Guangxin He, Bohan Zeng, Conghui He, Binhang Yuan, Wentao Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our method significantly outperforms existing approaches across diverse visual reasoning benchmarks, offering not only improved accuracy but also more interpretable and grounded reasoning processes. These results demonstrate the promise of dynamic inference mechanisms for enabling fine-grained, context-aware visual reasoning in next-generation MLLMs. ... Comprehensive evaluation and state-of-the-art results. Our experiments span a variety of vision-language tasks demanding multi-step reasoning. Across these scenarios, our approach significantly outperforms strong baselines, including models augmented with limited tool use and those employing chain-of-thought prompting. ... Finally, in Section 5.3, we conduct ablation studies to analyze the individual contributions of visual token scaling and verifier integration.
Researcher Affiliation Academia 1The Hong Kong University of Science and Technology, 2Peking University 3Shanghai Artificial Intelligence Laboratory, 4Imperial College London EMAIL, EMAIL, EMAIL
Pseudocode Yes Algorithm 1 characterizes the full training and test procedure of our method. ... Algorithm 1 Visual Reasoning with Visual Token Scaling and Verification
Open Source Code Yes Code and datasets are publicly released at https://vts-v.github.io/. ... Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: The code and data are in supplementary materials.
Open Datasets Yes Code and datasets are publicly released at https://vts-v.github.io/. ... To address this, we construct a dedicated dataset by building upon the single-image dataset of the LLa VA-One Vision dataset (LLa VA-OV) [14], which contains 3.2M vision-language examples covering a broad range of tasks.
Dataset Splits No To address this, we construct a dedicated dataset by building upon the single-image dataset of the LLa VA-One Vision dataset (LLa VA-OV) [14], which contains 3.2M vision-language examples covering a broad range of tasks. ... The resulting supervised dataset is defined as: DSFT = {(s1, τ) | t Hτ +1 = t , llm_as_a_judge(τ) = correct} . Each trajectory in DSFT is tool-grounded and preserves all intermediate reasoning states, providing rich supervision for training. After processing, the dataset contains approximately 315K high-quality examples. ... After filtering, the final dataset DDPO comprises 301K preference pairs for training the verifier to assess and guide visual token scaling quality.
Hardware Specification Yes Throughput on 80GB A800 GPUs (float16):
Software Dependencies No All SFT experiments are conducted using LLa MA Factory under unified settings, and DPO training is carried out with TRL.
Experiment Setup Yes Table 4: Hyperparameters for training Qwen2-VL-7B-Instruct & Qwen2.5-VL-7B-Instruct & LLa MA-3.2-11B-Vision-Instruct models Hyperparameter Value Lo RA Rank 8 Lo RA α 16 Lo RA Dropout 0 Lo RA Target all GPU 8 NVIDIA A800 Per Device Train Batch Size 1 Gradient Accumulation Steps 1 Warmup Ratio 0.03 Learning Rate 3e-5 Learning Rate Scheduler Cosine Unfreeze Vision Tower False Number Train Epoch 1 Max Gradient Norm 1.0 bf16 True Cut Off Length 65536