Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Video Perception Models for 3D Scene Synthesis

Authors: Rui Huang, Guangyao Zhai, Zuria Bauer, Marc Pollefeys, Federico Tombari, Leonidas Guibas, Gao Huang, Francis Engelmann

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that VIPSCENE significantly outperforms existing methods and generalizes well across diverse scenarios.
Researcher Affiliation Collaboration Rui Huang1 Guangyao Zhai2,4 Zuria Bauer3 Marc Pollefeys3,5 Federico Tombari2 Leonidas Guibas6 Gao Huang1 Francis Engelmann6 1Tsinghua University 2Technical University of Munich 3ETH Zurich 4Munich Center for Machine Learning 5Microsoft 6Stanford University
Pseudocode Yes Algorithm 1: VIPSCENE: Prompt-to-Scene Generation Input: Prompt π (text / image), video generator G, reconstructor R (Fast3R), monocular depth D (Uni Depth), detector S (Grounded-SAM), tracker T (MASt3R), assets A, weights λo, λb Output: Final metrically scaled collision-free scene S = {oi}N i=1, where each object oi = (ci, si, li, θi) is represented by its category ci C, size si R3, position li R3, and orientation θi R around the gravity axis. 1 (A) Prompt Video frames 3 {It}T t=1 Sample frames from V with fps = 2 4 (B) Video Metric 3D reconstruction 5 R R({It}T t=1) // Globally consistent 3D (unposed inputs) 6 {Dt}T t=1 D({It}T t=1) // Metric depths 7 R Rescale R with metric{Dt}T t=1 // Enforce metric scale 8 (C) Scene decomposition & object extraction 9 for t = 1 to T do 10 {(c, M (k) t )}k S(It) // Per-frame 2D instance masks with categories 11 {M (k) t }k Adaptive Erode({M (k) t }k) // Size-aware morphological denoising 12 n M i t T t=1 i=1 T ({It, {M (k) t }k}T t=1) // Temporal association / IDs 13 for i = 1 to N do 14 Pi Segment points from R by{M i t}T t=1 // Per-object point cloud 15 ci Majority label({M i t}T t=1) // Object category from detections 16 (D) 3D asset retrieval & alignment 17 for i = 1 to N do 18 (si, linit i , θinit i ) PCAInit(Pi) 19 Ci Retrieve candidates from A according to ci best , rmsemin + 20 foreach Q Ci do 21 foreach θ {θinit i , θinit i + π} do 22 (R , t ), rmse ICPAlign(Pi, Q; linit i , θ) // Eq. (1), R SO(3) 23 if rmse < rmsemin then 24 rmsemin rmse, 25 best (Q, R , t , θ) 26 (Qi, Ri, ti, θi) best, li ti 27 (E) Final scene refinement 28 lorig i li i // li denotes position variables 29 repeat // Gradient-based optimization 30 Lp =PN i=1 li lorig i 2 2 31 Lo =P i =j Area(BBoxi(li, si) BBoxj(lj, sj)) 32 Lb =PN i=1 Area(BBoxi(li, si) \ Room) 33 Ltotal Lp + λo Lo + λb Lb 34 {li} Update({li}, η {li}Ltotal) if Overlap({BBoxi}) = 0 and Ltotal < ε then 36 until converged 37 return S = {(ci, si, li, θi)}N i=1
Open Source Code No Answer: [Yes] Justification: Code will be released.
Open Datasets Yes Holodeck [69], we retrieve 3D models from a high-quality subset of Objaverse [7] to ensure realistic and diverse object representations in the scene.
Dataset Splits No For comparison with baselines, we follow prior work [69] and evaluate on four types of scenes living room, bedroom, bathroom, and kitchen. We ask GPT-4o [18] to produce 25 text prompts for each room type. Each prompt consists of a description of a room type and the desired items. Based on these prompts, we generate 100 rooms using each method under evaluation.
Hardware Specification Yes Among all stages, video generation is the most computationally expensive, requiring around 380s per video and 74GB GPU memory on an H100.
Software Dependencies No We utilize Cosmos [33] for video generation and adopt Fast3R [64] for 3D reconstruction. For open-vocabulary segmentation, we employ Grounded-SAM [41], and Uni Depth [36] is applied for monocular depth estimation.
Experiment Setup Yes We set λo = λb = 10.