Zero-Shot Robotic Manipulation with Pre-Trained Image-Editing Diffusion Models
Authors: Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Rich Walke, Chelsea Finn, Aviral Kumar, Sergey Levine
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 EXPERIMENTAL EVALUATION |
| Researcher Affiliation | Collaboration | 1University of California, Berkeley 2Stanford University 3Google Deep Mind |
| Pseudocode | Yes | Algorithm 1 Su SIE: Zero-Shot, Test-Time Execution |
| Open Source Code | No | The project website can be found at http://rail-berkeley.github.io/susie. |
| Open Datasets | Yes | Our dataset is Bridge Data V2 [59], a large and diverse dataset of robotic manipulation behaviors designed for evaluating open-vocabulary instructions. ... Our video-only dataset Dl is the Something-Something dataset [19], a dataset consisting of short video clips of humans manipulating various objects. |
| Dataset Splits | No | No explicit percentages or absolute sample counts for training, validation, and test splits were provided within the paper for general datasets, although environmental splits were noted for CALVIN. |
| Hardware Specification | Yes | We train for 40k steps with a batch size of 1024 on a single v4-64 TPU pod, which takes 17 hours. ... We train with a batch size of 256 for 445k steps on a single v4-8 TPU VM, which takes 15 hours. |
| Software Dependencies | No | The paper mentions software components like Instruct Pix2Pix, OWL-ViT, Flan-T5-Base, CLIP, MUSE, and DDIM sampler, but does not provide specific version numbers for these or other key software dependencies. |
| Experiment Setup | Yes | We finetune Instruct Pix2Pix [9] using similar hyperparameters to the initial Instruct Pix2Pix training. We use the Adam W optimizer [40] with a learning rate of 1e-4, a linear warmup of 800 steps, and weight decay of 0.01. ... At test time, we use an image guidance weight of 2.5 and a text guidance weight of 7.5. We use the DDIM sampler [56] with 50 sampling steps. ... We use the Adam optimizer [36] with a learning rate of 3e-4 and a linear warmup of 2000 steps. We train with a batch size of 256 for 445k steps ... We augment the observation and goal with random crops, random resizing, and color jitter. |