Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs

Authors: Yifan Shen, Yuanzhe Liu, Jingyuan Zhu, Xu Cao, Xiaofeng Zhang, Yixiao He, Wenming Ye, James M. Rehg, Ismini Lourentzou

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that f DPO achieves relative performance gains of 4.1% and 9.0% over standard DPO on spatial qualitative and quantitative tasks, respectively. Spatial Reasoner-R1, trained with f DPO, sets a new So TA on SPATIALRGPT-BENCH, outperforming the strongest baseline by 9.4% in average accuracy, while maintaining competitive performance on general vision-language tasks.
Researcher Affiliation Collaboration 1University of Illinois Urbana-Champaign 2University of Pennsylvania 3Shanghai Jiao Tong University 4Google
Pseudocode No The reasoning process is defined as a sequence of reasoning states: S = {s0, . . . , st, . . . , s T }, where st S represents a partial reasoning state, s0 is the initial state derived from T , and s T is a terminal state corresponding to a fully reasoned path. M3CTS operates through four key stages: Expand, Simulate, Backprop, and Select.
Open Source Code No We will open source our code at a later date.
Open Datasets Yes For SFT, we convert samples from the OPEN SPATIAL dataset [13] to reasoning chains using the M3CTS pipeline. ... We curate the OPEN SPATIAL REASONING dataset, a collection of 400K Vision Question Answering (VQA) preference pairs (yp, yl), to support training of preference-based spatial reasoning models. This dataset is derived from the OPEN SPATIAL dataset [13]...
Dataset Splits No For SFT, we convert samples from the OPEN SPATIAL dataset [13] to reasoning chains using the M3CTS pipeline. While the original OPEN SPATIAL dataset provides single-sentence answers, we transform 400K samples, grounded in distinct images, into structured Long Co T reasoning chains, where examples are used to teach the model to generate high-quality, step-by-step spatial reasoning responses. For Direct Preference Optimization (DPO) training, we similarly use Adam W with learning rate of 1 10 7, weight decay of 0.05, and a 5% warm-up, training with a batch size of 1 per device.
Hardware Specification Yes We train the 8B-parameter model in two stages on two NVIDIA H100 GPUs, each stage taking approximately 2.5 days. We gratefully acknowledge the cloud TPU credits from the Google TPU Research Cloud (TRC) program and the Google Tunix (Tune-in-JAX) team for their feedback.
Software Dependencies No For supervised fine-tuning, we employ Adam W optimizer with a learning rate of 4 10 5, weight decay of 0.05, and a 5% linear warm-up schedule, using a batch size of 2 per device with gradient accumulation over 4 steps. For Direct Preference Optimization, we similarly use Adam W with learning rate of 1 10 7, weight decay of 0.05, and a 5% warm-up, training with a batch size of 1 per device.
Experiment Setup Yes For supervised fine-tuning, we employ Adam W optimizer with a learning rate of 4 10 5, weight decay of 0.05, and a 5% linear warm-up schedule, using a batch size of 2 per device with gradient accumulation over 4 steps. For Direct Preference Optimization, we similarly use Adam W with learning rate of 1 10 7, weight decay of 0.05, and a 5% warm-up, training with a batch size of 1 per device.