Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MindJourney: Test-Time Scaling with World Models for Spatial Reasoning

Authors: Yuncong Yang, Jiageng Liu, Zheyuan Zhang, Siyuan Zhou, Reuben Tan, Jianwei Yang, Yilun Du, Chuang Gan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In a comprehensive evaluation on the SAT benchmark, Spatial Navigator achieves substantial gains on multiple spatial reasoning tasks. Our method improves top-1 accuracy performance across four very different VLM back-ends with two distinct world models by an average 7.7%, with the largest single gain over 10%. We present Mind Journey, the first test-time scaling framework that couples a VLM with a controllable video world model, enabling searching through imagined 3D space to improve 3D spatial reasoning without finetuning. We empirically demonstrate that our model offers significant improvement on multiple spatial reasoning tasks. We demonstrate that our method is model-agnostic: it boosts four different VLMs with two distinct world models.
Researcher Affiliation	Collaboration	Yuncong Yang1 , Jiageng Liu1 , Zheyuan Zhang2, Siyuan Zhou3, Reuben Tan4, Jianwei Yang4 , Yilun Du5, Chuang Gan1 1UMass Amherst, 2JHU, 3HKUST, 4Microsoft Research, 5Harvard EMAIL
Pseudocode	Yes	Algorithm 1 Spatial Beam Search for Action Space Exploration
Open Source Code	No	We will release SWM checkpoints, inference scripts, and evaluation notebooks under an MIT license on acceptance, with a ready-made Dockerfile for environment setup.
Open Datasets	Yes	All datasets used (Real Estate10K, DL3DV-10K, Habitat scenes, SAT) are publicly available. The training set for our Search World Model (SWM) comprises three components: HM3D, DL3DV-10K, and Real Estate10K [Szot et al., 2022, Ling et al., 2024, Zhou et al., 2018]. Our main benchmark is the Spatial Aptitude Training (SAT) benchmark.
Dataset Splits	No	SAT is split into SAT-Synthesized, 4000 synthetic questions rendered in AI2-THOR [Kolve et al., 2017] indoor scenes, and SAT-Real, real images spanning indoor and outdoor environments. The SAT-Real split comprises 150 real-image queries spanning indoor and outdoor scenes. Because the synthetic split contains 4 000 questions, we evaluate on a random 500-question subset to keep the o1 runs tractable.
Hardware Specification	Yes	All inference experiments were run on high-performance NVIDIA GPUs: when using Search World Model as the world model, we employed A40 GPUs with 40GB of VRAM; when using Stable-Virtual-Camera as the world model, we ran on H100 GPUs with 80GB of VRAM; and for all experiments combining the Intern VL3-14B VLM with the Search World Model, we also used H100 GPUs to accommodate the larger memory footprint of the vision language model.
Software Dependencies	No	The paper does not explicitly state specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions, etc.). It mentions using `Wan2.2-TI2V-5B` as a backbone and following `Re Cam Master`'s implementation, which refer to models or frameworks rather than general software dependencies with versions.
Experiment Setup	Yes	Unless noted otherwise, we use the same search configuration for every experiment: search depth n = 3 steps; up to k = 3 consecutive repetitions per primitive action during each expansion; exploration and helpfulness thresholds γexp = 8, γhelp = 8. Optimization is performed with Adam and a linear warmup schedule to a peak learning rate of 3e 5, using bfloat16 precision for efficiency and clipping gradients to a maximum norm of 1.0 for stability.