Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

GRIT: Teaching MLLMs to Think with Images

Authors: Yue Fan, Xuehai He, Diji Yang, Kaizhi Zheng, Ching-Chen Kuo, Yuting Zheng, Xinze Guan, Xin Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive evaluations demonstrate that GRIT effectively trains MLLMs to produce coherent and visually grounded reasoning chains, showing a successful unification of reasoning and grounding abilities. All code, data, and checkpoints will be released. ... We first evaluate the grounded reasoning performance of models trained using the GRIT method in both grounding and reasoning perspectives. ... The results are summarized in Table 1.
Researcher Affiliation Collaboration 1UC Santa Cruz 2UC Santa Barbara 3e Bay EMAIL, EMAIL
Pseudocode No The paper describes the GRIT method and GRPO-GR algorithm in natural language text and uses diagrams (e.g., Figure 2 for Model update via GRPO-GR), but no structured pseudocode or algorithm blocks are explicitly presented.
Open Source Code No All code, data, and checkpoints will be released.
Open Datasets Yes We evaluate models trained with GRIT on curated testing sets derived by sampling from six public datasets: Visual Spatial Reasoning (VSR) [14] focusing on spatial relation verification, Tally QA [15] on object counting, GQA [27] on compositional object spatial questions, MME [28] on diverse visual tasks including counting and position, Math Vista-mini [29] on mathematical reasoning in visual contexts, and position subset of OVDEval [30] on open-vocabulary object grounding.
Dataset Splits Yes Demonstrating the data efficiency of our GRIT method, we train on a dataset of only 20 unique image-query-answer triplets. This small training set is drawn from the Visual-Spatial Reasoning (VSR) [14] and Tally QA [15] datasets. ... We use only the counting, position, and existence subsets to broaden our evaluation scope. Math Vista [29] evaluates mathematical reasoning in visual contexts. Following prior works, we adopt its Test Mini split. The statistic for the testing data is shown in Table 2.
Hardware Specification Yes All training is conducted on 8 NVIDIA A100 (80GB) GPUs with Deepspeed Zero2 and the time for training each model is approximately 12 hours.
Software Dependencies No All training is conducted on 8 NVIDIA A100 (80GB) GPUs with Deepspeed Zero2 and the time for training each model is approximately 12 hours. The paper mentions "Deepspeed Zero2" but does not provide specific version numbers for it or any other key software dependencies like programming languages or deep learning frameworks.
Experiment Setup Yes We train the models for 200 steps with a total batch size of 128. During GRPO-GR training, we generate 4 candidate reasoning traces per input sample during training with a learning rate of 2 10 e-6. The optimizer for the training is Adam W and a Cosine scheduler is adopted.