RoboDreamer: Learning Compositional World Models for Robot Imagination

Authors: Siyuan Zhou, Yilun Du, Jiaben Chen, Yandong Li, Dit-Yan Yeung, Chuang Gan

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4. Experiments In this section, we evaluate the proposed Robo Dreamer model in terms of its ability to enable generalizable compositional generation.
Researcher Affiliation Academia 1Hong Kong University of Science and Technology 2Massachusetts Institute of Technology 3University of California, San Diego 4University of Central Florida 5University of Massachusetts Amherst.
Pseudocode Yes Algorithm 1 Training and Algorithm 2 Inference
Open Source Code No The paper mentions using a third-party open-source codebase from (Ko et al., 2023) but does not provide concrete access to the source code for the methodology described in this paper.
Open Datasets Yes We take the real-world robotics dataset RT-1 (Brohan et al., 2022) to evaluate video generation. The dataset consists of various robotic manipulation tasks, i.e. pick brown chip bag from middle drawer. The robot is required to detect the middle drawer, pick a brown chip bag, and then place it on the table. Specifically, we train Robo Dreamer on about 70k demonstrations and 500 different tasks.
Dataset Splits No The paper uses the RT-1 dataset and mentions training on 70k demonstrations, but does not provide specific train/validation/test dataset splits with percentages or sample counts.
Hardware Specification Yes We train our video diffusion models with 256 batch size and 5e-5 learning rate on about 100 V100 GPUs.
Software Dependencies No The paper mentions pre-trained models like T5-XXL and components from Stable Diffusion (VQVAE) but does not provide specific version numbers for these or other ancillary software components.
Experiment Setup Yes The video diffusion model of Robo Dreamer is built upon AVDC and Imagen (Ho et al., 2022). We use a spatial-temporal convolution network in each Res Net block of U-Net for efficiency. We introduce the temporal attention layer on the Res Net block. We utilize a three-stage cascaded diffusion model for super-resolution. Our method Robo Dreamer is built upon AVDC (Ko et al., 2023) and Imagen (Ho et al., 2022) and we utilize a three-stage cascaded diffusion model for super-resolution. For video diffusion models, we use 4 Res Net Block within U-Net and each block is composed of spatial-temporal convolution layers and cross-attention layers with conditioned instructions. We introduce temporal-attention layers in the last block within the encoder of U-Net and the first block within the decoder. The base channel is 128 and the channel multiplier is [1, 2, 4, 8]. We train our video diffusion models with 256 batch size and 5e-5 learning rate on about 100 V100 GPUs.