Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
GRIT: Teaching MLLMs to Think with Images
Authors: Yue Fan, Xuehai He, Diji Yang, Kaizhi Zheng, Ching-Chen Kuo, Yuting Zheng, Xinze Guan, Xin Wang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive evaluations demonstrate that GRIT effectively trains MLLMs to produce coherent and visually grounded reasoning chains, showing a successful unification of reasoning and grounding abilities. All code, data, and checkpoints will be released. ... We first evaluate the grounded reasoning performance of models trained using the GRIT method in both grounding and reasoning perspectives. ... The results are summarized in Table 1. |
| Researcher Affiliation | Collaboration | 1UC Santa Cruz 2UC Santa Barbara 3e Bay EMAIL, EMAIL |
| Pseudocode | No | The paper describes the GRIT method and GRPO-GR algorithm in natural language text and uses diagrams (e.g., Figure 2 for Model update via GRPO-GR), but no structured pseudocode or algorithm blocks are explicitly presented. |
| Open Source Code | No | All code, data, and checkpoints will be released. |
| Open Datasets | Yes | We evaluate models trained with GRIT on curated testing sets derived by sampling from six public datasets: Visual Spatial Reasoning (VSR) [14] focusing on spatial relation verification, Tally QA [15] on object counting, GQA [27] on compositional object spatial questions, MME [28] on diverse visual tasks including counting and position, Math Vista-mini [29] on mathematical reasoning in visual contexts, and position subset of OVDEval [30] on open-vocabulary object grounding. |
| Dataset Splits | Yes | Demonstrating the data efficiency of our GRIT method, we train on a dataset of only 20 unique image-query-answer triplets. This small training set is drawn from the Visual-Spatial Reasoning (VSR) [14] and Tally QA [15] datasets. ... We use only the counting, position, and existence subsets to broaden our evaluation scope. Math Vista [29] evaluates mathematical reasoning in visual contexts. Following prior works, we adopt its Test Mini split. The statistic for the testing data is shown in Table 2. |
| Hardware Specification | Yes | All training is conducted on 8 NVIDIA A100 (80GB) GPUs with Deepspeed Zero2 and the time for training each model is approximately 12 hours. |
| Software Dependencies | No | All training is conducted on 8 NVIDIA A100 (80GB) GPUs with Deepspeed Zero2 and the time for training each model is approximately 12 hours. The paper mentions "Deepspeed Zero2" but does not provide specific version numbers for it or any other key software dependencies like programming languages or deep learning frameworks. |
| Experiment Setup | Yes | We train the models for 200 steps with a total batch size of 128. During GRPO-GR training, we generate 4 candidate reasoning traces per input sample during training with a learning rate of 2 10 e-6. The optimizer for the training is Adam W and a Cosine scheduler is adopted. |