Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning

Authors: Minheng Ni, Zhengyuan Yang, Linjie Li, Chung-Ching Lin, Kevin Lin, Wangmeng Zuo, Lijuan Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On Chart QA, our approach improves accuracy from 70.88% (format-finetuned baseline) to 90.04%, surpassing the 83.92% accuracy achieved by reinforcement finetuning relying solely on text-based Co T. The result shows that our grounded Co T is more effective for multimodal reasoning compared with the text-only Co T. Moreover, Point-RFT exhibits superior generalization capability across several out-of-domain visual document reasoning benchmarks, including Char Xiv, Plot QA, Icon QA, Tab MWP, etc., and highlights its potential in complex real-world scenarios.
Researcher Affiliation	Collaboration	Minheng Ni1,2 , Zhengyuan Yang3 , Linjie Li3, Chung-Ching Lin3, Kevin Lin3, Wangmeng Zuo2B, Lijuan Wang3B 1Hong Kong Polytechnic University 2Harbin Institute of Technology 3Microsoft
Pseudocode	No	The paper describes methods and processes in paragraph form and through mathematical equations, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our model, code, and dataset can be found at this link.
Open Datasets	Yes	We curate a 71K-example dataset where every reasoning step is aligned with point-level visual references, enabling supervised format finetuning that teaches the model to think while pointing. ... Our model, code, and dataset can be found at this link. ... We evaluate our approach on six multimodal reasoning benchmark datasets spanning diverse domains: Chart QA (Masry et al., 2022): ... Char Xiv (Wang et al., 2024): ... Plot QA (Methani et al., 2020): ... Icon QA (Lu et al., 2021): ... Tab MWP (Lu et al., 2022b): ... Counting (Li et al., 2023):
Dataset Splits	Yes	On Chart QA, our approach improves accuracy from 70.88% (format-finetuned baseline) to 90.04%. ... We use the official test split for evaluation. ... We use the official validation split for evaluation. ... We randomly sampled 2,000 examples as the test split. ... We use the official 1,000-sample mini-test split. ... we sampled 200 examples for evaluation.
Hardware Specification	Yes	We implement our two-stage training pipeline using Py Torch with 8 A100 GPUs based on Easy-R1 (Zheng et al., 2025).
Software Dependencies	No	The paper mentions Py Torch and Easy-R1, but does not provide specific version numbers for these software components.
Experiment Setup	Yes	All models use Adam W optimizer with 5 10 5 learning rate, and 512 batch size for SFT and 56 batch size for RL. Training converges in 500 steps for SFT and 100 RL steps with β = 0.00. We use soft matching for numeric answers (tolerating 5% relative error) and exact matching for other responses in GRPO.