Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling

Authors: Kairun Wen, Yuzhihuang, Runyu Chen, Hui Zheng, Yunlong Lin, Panwang Pan, Chenxin Li, Wenyan Cong, Jian Jun Zhang, Junbin Lu, Chenguo Lin, Dilin Wang, Zhicheng Yan, Hongyu Xu, Justin Theiss, Yue Huang, Xinghao Ding, Rakesh Ranjan, Zhiwen Fan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental evaluations on three benchmark tasks, namely video depth estimation, camera pose estimation, and camera intrinsics estimation, demonstrate that our 4D modeling achieves superior performance in capturing physical-scale measurements with greater global accuracy than existing methods. We evaluate Dynamic Gen through three benchmarks: video depth estimation, camera pose and intrinsics estimation. In this section , we present experimental results to evaluate the robustness of our Dynamic Gen pipeline.
Researcher Affiliation Collaboration Kairun Wen1 , Yuzhi Huang1 , Runyu Chen1, Hui Zheng1, Yunlong Lin1, Panwang Pan1, Chenxin Li2, Wenyan Cong3, Jian Zhang1, Junbin Lu4, Chenguo Lin5, Dilin Wang6, Zhicheng Yan6, Hongyu Xu6, Justin Theiss6, Yue Huang1, Xinghao Ding1B, Rakesh Ranjan6, Zhiwen Fan3 1XMU 2CUHK 3UT Austin 4UW 5PKU 6Meta
Pseudocode No The paper describes its methodology in text and provides pipeline diagrams (e.g., Figure 3), but it does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code Yes The code, which will be made publicly available, is uploaded as a zip file.
Open Datasets Yes To address the scarcity of available 4D scene data, Dynamic Gen unifies video data from various real-world video datasets, including DAVIS2017 [24], Youtube-VIS [25], UVO-dense [26], VOST [27], BURST [28], MOSE [29] and SA-V [30], alongside existing synthetic 4D datasets from Point Odyssey [15], Spring [16], Dynamic Replica [17], MVS-Synth [18], Real Cam-Vid [19] and Dyn Pose-100K [20]. The inclusion of these datasets is mainly motivated by their potential as scalable data sources for 4D scene understanding. All utilized data are sourced from open-access platforms.
Dataset Splits Yes Evaluations are conducted on the Sintel [33] and KITTI [75] datasets, following standard protocols [37] by applying global shift and scale alignment to the predicted depth maps. Experiments are conducted on the Sintel [33] and TUM-dynamics [78] datasets, following LEAP-VO s split for Sintel and subsampling the first 270 frames of TUM-dynamics, as done in Mon ST3R. To assess caption quality, we sampled 100 videos from the SA-V dataset [30]. For this evaluation, we trained the model on the "americano" scene from the Hyper Ne RF dataset and benchmarked it against a re-implemented 4D-Lang Splat* baseline. On a random sample of 100 videos from SA-V data, our generated captions demonstrated high performance across all four criteria, as detailed in the Tab. 8. On a sub-sample of 88 videos from our dataset (i.e., filtered DAVIS), our captions performed excellently.
Hardware Specification Yes For a reproducible analysis of computational performance, we processed the entire Sintel training set (23 videos) on NVIDIA H20 GPUs. Module Hardware Used Avg. Time / Sintel Peak VRAM Notes Video (mins) (GB) 1x H20 GPU 2x H20 GPU 1x CPU Core
Software Dependencies Yes Specifically, our pipeline first employs Qwen2.5-VL [61] to identify moving objects and determine their semantic categories. These categories are then used to prompt SA2VA [54] for generating corresponding object masks. For video depth estimation, we use Uni Depth V2 [52], a monocular depth estimation network, to estimate initial depth maps D and initial camera intrinsics Kinit. For dense pixel motion estimation, we utilize Co-Tracker V3 [51] for its robustness.
Experiment Setup Yes To address this, we developed a filtering strategy incorporating several distinct criteria: proximal depth, focal-length stability, video blur, camera motion smoothness, and non-perspective distortion. Each of these aspects is quantified by a normalized score. We combine these scores as features and employ a Random Forest model to predict a video quality score ranging from 0 to 5. For model training, we manually annotated approximately 1,000 videos, assigning scores between 0 (indicating largely unsuitable, poor quality or insufficient dynamics) and 5 (indicating highly suitable, good quality and sufficient dynamics). We perform this over all pairs within a temporal sliding window of 5 frames.