Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Spatial Understanding from Videos: Structured Prompts Meet Simulation Data

Authors: Haoyu Zhang, Meng Liu, Zaijing Li, Haokun Wen, Weili Guan, Yaowei Wang, Liqiang Nie

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments across multiple benchmarks demonstrate the individual and combined effectiveness of our prompting and fine-tuning strategies, and yield insights that may inspire future research on visual-spatial understanding. ... 5 Experiments
Researcher Affiliation Academia 1Harbin Institute of Technology (Shenzhen) 2Pengcheng Laboratory 3Shandong Jianzhu University 4Zhongguancun Academy
Pseudocode No The paper illustrates a 'Spatial Mind prompting strategy' (Figure 1) and 'Scene Decomposition' and 'Question Decomposition' steps (Figure 7), which are structured processes. It also describes reasoning steps in text (Section 3.2) and refers to code files for reasoning steps (Appendix D.2). However, none of these are explicitly labeled as pseudocode or an algorithm block within the paper's main content or appendices.
Open Source Code Yes Code files for prompting can be found in the Supplementary Material. The data pipeline, data, and model weights will be publicly available upon paper publication. ... codes/nav_script.py is the script that creates the navigation scan, presenting the exact implementation. ... All reasoning steps for the various question types are included in the codes/reason_steps.py file of Supplementary Material for reference and reproducibility.
Open Datasets Yes We develop a scalable dataset generation pipeline to construct Scan Forge QA, a synthetic spatial question-answering dataset that enables VLMs to acquire spatial commonsense through fine-tuning. ... The full Scan Forge QA dataset includes 34,276 single-room scenes, 103K simulated video scans, and 925K question-answering pairs for training. ... All datasets used in our work are commonly used datasets with open access.
Dataset Splits Yes The full Scan Forge QA dataset includes 34,276 single-room scenes, 103K simulated video scans, and 925K question-answering pairs for training. ... Note that the statistics for the Scan QA and SQA3D datasets are reported on their respective validation sets.
Hardware Specification Yes All experiments are conducted on 8 NVIDIA H20 GPUs.
Software Dependencies No The paper mentions using specific models like 'GPT-4o', 'Qwen2.5-VL-7B', and the 'Unity engine' (for simulation) but does not provide specific version numbers for these or other ancillary software dependencies (e.g., programming languages, libraries like PyTorch, CUDA).
Experiment Setup Yes For open-source models, we adopted each model s default parameter settings, including learning rate, number of frames, and input resolution. For closed-source models, GPT-4o processes 16 frames per video, while all Gemini models operate at a fixed sampling rate of 1 frame per second (1 FPS).