Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

RoboScape: Physics-informed Embodied World Model

Authors: Yu Shang, Xin Zhang, Yinzhou Tang, Lei Jin, Chen Gao, Wei Wu, Yong Li

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that Robo Scape generates videos with superior visual fidelity and physical plausibility across diverse robotic scenarios. We further validate its practical utility through downstream applications including robotic policy training with generated data and policy evaluation. We conduct comprehensive experiments to evaluate our world model from three aspects: video generation quality, robotic policy learning using synthetic data, and robotic policy evaluation.
Researcher Affiliation Collaboration Yu Shang1, Xin Zhang2, Yinzhou Tang1, Lei Jin1, Chen Gao1, Wei Wu2 , Yong Li1 1Tsinghua University 2Manifold AI
Pseudocode No The paper describes the methodology using text and diagrams (e.g., Figure 2: 'Overview of the physics-informed world model'), but does not include any explicit pseudocode or algorithm blocks.
Open Source Code Yes Our code and demos are available at: https://github.com/tsinghua-fib-lab/Robo Scape.
Open Datasets Yes In our experiment, we use 50,000 videos extracted from the Agi Bot World-Beta dataset [40], covering 147 tasks and 72 skills. ... We further validated our approach using the π0 [32] model on the challenging LIBERO [44] task suite. ... In the experiments on the Robomimic Lift task [43]...
Dataset Splits Yes Our dataset comprises approximately 6.5M training clips and 1.2K test clips.
Hardware Specification Yes Training completes in approximately 24 hours on a cluster of 32 NVIDIA A800-SXM4-80GB GPUs.
Software Dependencies No The paper mentions several tools and models like MAGVIT-2, Video Depth Anything, Spatial Tracker, Trans Net V2, Intern-VL, Flow Net, Diffusion Policy (DP), and π0. However, it does not specify version numbers for these software components or any programming languages/libraries (e.g., Python, PyTorch) used.
Experiment Setup Yes We preprocess videos by extracting 16-frame clips sampled at 2Hz, yielding approximately 6.5 million training clips. The model is trained for 5 epochs using the following hyperparameters: λ1 = 1, λ2 = 0.01, λ3 = 1, and γ = 5. During inference, we use the first frame as a conditional input to autoregressively predict the subsequent 15 frames.