RoboDreamer: Learning Compositional World Models for Robot Imagination
Authors: Siyuan Zhou, Yilun Du, Jiaben Chen, Yandong Li, Dit-Yan Yeung, Chuang Gan
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4. Experiments In this section, we evaluate the proposed Robo Dreamer model in terms of its ability to enable generalizable compositional generation. |
| Researcher Affiliation | Academia | 1Hong Kong University of Science and Technology 2Massachusetts Institute of Technology 3University of California, San Diego 4University of Central Florida 5University of Massachusetts Amherst. |
| Pseudocode | Yes | Algorithm 1 Training and Algorithm 2 Inference |
| Open Source Code | No | The paper mentions using a third-party open-source codebase from (Ko et al., 2023) but does not provide concrete access to the source code for the methodology described in this paper. |
| Open Datasets | Yes | We take the real-world robotics dataset RT-1 (Brohan et al., 2022) to evaluate video generation. The dataset consists of various robotic manipulation tasks, i.e. pick brown chip bag from middle drawer. The robot is required to detect the middle drawer, pick a brown chip bag, and then place it on the table. Specifically, we train Robo Dreamer on about 70k demonstrations and 500 different tasks. |
| Dataset Splits | No | The paper uses the RT-1 dataset and mentions training on 70k demonstrations, but does not provide specific train/validation/test dataset splits with percentages or sample counts. |
| Hardware Specification | Yes | We train our video diffusion models with 256 batch size and 5e-5 learning rate on about 100 V100 GPUs. |
| Software Dependencies | No | The paper mentions pre-trained models like T5-XXL and components from Stable Diffusion (VQVAE) but does not provide specific version numbers for these or other ancillary software components. |
| Experiment Setup | Yes | The video diffusion model of Robo Dreamer is built upon AVDC and Imagen (Ho et al., 2022). We use a spatial-temporal convolution network in each Res Net block of U-Net for efficiency. We introduce the temporal attention layer on the Res Net block. We utilize a three-stage cascaded diffusion model for super-resolution. Our method Robo Dreamer is built upon AVDC (Ko et al., 2023) and Imagen (Ho et al., 2022) and we utilize a three-stage cascaded diffusion model for super-resolution. For video diffusion models, we use 4 Res Net Block within U-Net and each block is composed of spatial-temporal convolution layers and cross-attention layers with conditioned instructions. We introduce temporal-attention layers in the last block within the encoder of U-Net and the first block within the decoder. The base channel is 128 and the channel multiplier is [1, 2, 4, 8]. We train our video diffusion models with 256 batch size and 5e-5 learning rate on about 100 V100 GPUs. |