Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation
Authors: Siyuan Huang, Liliang Chen, Pengfei Zhou, Shengcong Chen, Yue Liao, Zhengkai Jiang, Yue Hu, Peng Gao, Hongsheng Li, Maoqing Yao, Guanghui Ren
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To demonstrate the effectiveness of proposed method, we evaluate ENERVERSE in two different domains, e.g. video generation quality and robotic policy performance. ... Tab. 1 illustrates that our method substantially outperforms Dynamic Crafter (FN) in both quantitative and qualitative evaluations. In terms of quantitative metrics, our approach achieves a higher PSNR and a lower FVD. |
| Researcher Affiliation | Collaboration | 1SJTU 2Agi Bot 3Shanghai AI Lab 4CUHK MMLab 5LV-NUS Lab |
| Pseudocode | No | The paper describes processes like chunk-wise autoregressive generation and multi-view diffusion generator block in text and figures, but does not present any formal pseudocode or algorithm blocks. Figure 12 shows a block diagram for the action policy head, not pseudocode. |
| Open Source Code | No | This may be temporary, and we are working hard to promote the process of open source. (Justification for Question 5 in NeurIPS Checklist) |
| Open Datasets | Yes | Training Data: We selected several public datasets characterized by well-defined task logic, including RT-1 [4], Taco-Play [39], Mani Skill [14], Bridge V1 [46], Language Table [27], and Robo Turk [28] for pretraining. |
| Dataset Splits | Yes | We evaluate robotic policies using the LIBERO [26] benchmark, which consists of four distinct task suites: LIBERO-Spatial, LIBERO-Object, LIBERO-Goal, and LIBERO-Long. Each suite contains 10 tasks, each with 50 human demonstrations. ... For evaluation, all models are tested across tasks using 50 rollouts per task, with results averaged over three random seeds. ... CALVIN [29] is an open-source simulated benchmark designed for learning long-horizon tasks. It consists of four distinct scenes (A, B, C, and D) and introduces the ABC D evaluation protocol, where models are trained on environments A, B, and C and evaluated on environment D. |
| Hardware Specification | Yes | For efficiency, ENERVERSE-A reuses features from the first denoising step and predicts action chunks, achieving about 280 ms per 8-step action chunk on a single RTX 4090. ... During the action-related fine-tuning training stage, using LIBERO-Spatial as an example, the single S-RGB setting requires 8 A100 GPUs for approximately 20 hours during the video generation adaptation stage and an additional 12 hours for the action learning stage. |
| Software Dependencies | No | Our model is conducted based on UNet-based Video Diffusion Models (VDM) [53], and can be easily adapted to Di T [32] architectures. ... For ENERVERSE-D, we integrate 4D Gaussian Splatting using the official implementation [50]. ... The action head adopts the Diffusion Policy (DP) architecture [10]. The paper mentions general software frameworks and models but does not specify version numbers for libraries, programming languages, or other dependencies. |
| Experiment Setup | Yes | Table 10: Training details and hyperparameters used in our experiments. Diffusion Setup Diffusion steps: 1000; Noise schedule: Linear; β0 = 0.00085; βT = 0.0120 Sampling Parameters Sampler: DDIM; Steps: 500 Input Video resolution: 320 512; Chunk size: 8; Encoded with VAE ... Video Training Learning rate: 5 10 5; Optimizer: Adam; Batch/GPU (single-view): 8; Batch/GPU (multi-view): 1 Parameterization: v-prediction; Max steps: 100,000; Gradient clipping: 0.5 (norm) Policy Training Same as video training, but with sample-prediction parameterization Number of Parameters Base model (Dynami Crafter): 1.4B; Policy head (Di T blocks): 190M; VAE (frozen): 83.7M |