Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning
Authors: Minheng Ni, Zhengyuan Yang, Linjie Li, Chung-Ching Lin, Kevin Lin, Wangmeng Zuo, Lijuan Wang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On Chart QA, our approach improves accuracy from 70.88% (format-finetuned baseline) to 90.04%, surpassing the 83.92% accuracy achieved by reinforcement finetuning relying solely on text-based Co T. The result shows that our grounded Co T is more effective for multimodal reasoning compared with the text-only Co T. Moreover, Point-RFT exhibits superior generalization capability across several out-of-domain visual document reasoning benchmarks, including Char Xiv, Plot QA, Icon QA, Tab MWP, etc., and highlights its potential in complex real-world scenarios. |
| Researcher Affiliation | Collaboration | Minheng Ni1,2 , Zhengyuan Yang3 , Linjie Li3, Chung-Ching Lin3, Kevin Lin3, Wangmeng Zuo2B, Lijuan Wang3B 1Hong Kong Polytechnic University 2Harbin Institute of Technology 3Microsoft |
| Pseudocode | No | The paper describes methods and processes in paragraph form and through mathematical equations, but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our model, code, and dataset can be found at this link. |
| Open Datasets | Yes | We curate a 71K-example dataset where every reasoning step is aligned with point-level visual references, enabling supervised format finetuning that teaches the model to think while pointing. ... Our model, code, and dataset can be found at this link. ... We evaluate our approach on six multimodal reasoning benchmark datasets spanning diverse domains: Chart QA (Masry et al., 2022): ... Char Xiv (Wang et al., 2024): ... Plot QA (Methani et al., 2020): ... Icon QA (Lu et al., 2021): ... Tab MWP (Lu et al., 2022b): ... Counting (Li et al., 2023): |
| Dataset Splits | Yes | On Chart QA, our approach improves accuracy from 70.88% (format-finetuned baseline) to 90.04%. ... We use the official test split for evaluation. ... We use the official validation split for evaluation. ... We randomly sampled 2,000 examples as the test split. ... We use the official 1,000-sample mini-test split. ... we sampled 200 examples for evaluation. |
| Hardware Specification | Yes | We implement our two-stage training pipeline using Py Torch with 8 A100 GPUs based on Easy-R1 (Zheng et al., 2025). |
| Software Dependencies | No | The paper mentions Py Torch and Easy-R1, but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | All models use Adam W optimizer with 5 10 5 learning rate, and 512 batch size for SFT and 56 batch size for RL. Training converges in 500 steps for SFT and 100 RL steps with β = 0.00. We use soft matching for numeric answers (tolerating 5% relative error) and exact matching for other responses in GRPO. |