Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Struct2D: A Perception-Guided Framework for Spatial Reasoning in MLLMs
Authors: Fangrui Zhu, Hanhui Wang, Yiming Xie, Jing Gu, Tianye Ding, Jianwei Yang, Huaizu Jiang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We introduce Struct2D, a perception-guided prompting framework that combines bird s-eye-view (BEV) images with object marks and object-centric metadata, optionally incorporating egocentric keyframes when needed. Using Struct2D, we conduct an in-depth zero-shot analysis of closed-source MLLMs (e.g., GPT-o3) and find that they exhibit surprisingly strong spatial reasoning abilities when provided with structured 2D inputs, effectively handling tasks such as relative direction estimation and route planning. Building on these insights, we construct Struct2D-Set, a large-scale instruction tuning dataset with 200K fine-grained QA pairs across eight spatial reasoning categories, generated automatically from 3D indoor scenes. We fine-tune an open-source MLLM (Qwen2.5VL) on Struct2D-Set, achieving competitive performance on multiple benchmarks, including 3D question answering, dense captioning, and object grounding. |
| Researcher Affiliation | Collaboration | 1 Northeastern University 2 Microsoft Research 3 University of Southern California 4 University of California, Santa Cruz |
| Pseudocode | Yes | Algorithm 1 outlines the core procedure for constructing the Struct2D prompt. Given an input video V, depth frames D, a reconstructed 3D scene P, and a set of target objects O, we begin by rendering a BEV image v and projecting each object oi O into the view using the RGB camera parameters Crgb. |
| Open Source Code | Yes | https://github.com/neu-vi/struct2d |
| Open Datasets | Yes | Struct2D-Set consists of 200K QA pairs generated from over 6K richly annotated indoor scenes, sourced from large-scale 3D reconstruction datasets ARKit Scenes[4], Scan Net [19], and Scan Net++[88]. |
| Dataset Splits | Yes | We construct a subset of 422 QA pairs for evaluation, selected due to API call budgets. Table 4: 3D Question Answering Evaluation on Scan QA [3] and SQA3D [53] datasets. Methods Scan QA(val) SQA3D(val) |
| Hardware Specification | Yes | Training with the whole Struct2D-Set takes approximately 8 hours on 8 H200 GPUs. |
| Software Dependencies | No | We adopt Qwen2.5VL [72] as our base MLLM for instruction tuning. During training, the model receives BEV images with filtered object marks and object-centric metadata as core inputs. ... At evaluation time, we follow standard practices from prior work [31, 63], reconstructing point clouds offline using Bundle Fusion [18], detecting 3D objects using Mask3D[66] and Uni Det [37], and projecting the results to produce BEV images and 2D object marks. For object-level grounding, we apply a rule-based method to identify the relevant objects mentioned in each question. |
| Experiment Setup | Yes | All visual inputs are resized to 480 480, and object marks are adaptively scaled based on their original resolution. ... The model is trained for one epoch using a base learning rate of 2e-6 with cosine annealing, taking approximately 8 hours on 8 H200 GPUs. ... The BEV images are resized to 640 640. Keyframes are resized to 256 246 and stitched into compact 1 2 or 2 4 grids, enabling efficient batch loading and reducing GPU memory consumption. |