Zero-Shot Robotic Manipulation with Pre-Trained Image-Editing Diffusion Models

Authors: Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Rich Walke, Chelsea Finn, Aviral Kumar, Sergey Levine

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 EXPERIMENTAL EVALUATION
Researcher Affiliation Collaboration 1University of California, Berkeley 2Stanford University 3Google Deep Mind
Pseudocode Yes Algorithm 1 Su SIE: Zero-Shot, Test-Time Execution
Open Source Code No The project website can be found at http://rail-berkeley.github.io/susie.
Open Datasets Yes Our dataset is Bridge Data V2 [59], a large and diverse dataset of robotic manipulation behaviors designed for evaluating open-vocabulary instructions. ... Our video-only dataset Dl is the Something-Something dataset [19], a dataset consisting of short video clips of humans manipulating various objects.
Dataset Splits No No explicit percentages or absolute sample counts for training, validation, and test splits were provided within the paper for general datasets, although environmental splits were noted for CALVIN.
Hardware Specification Yes We train for 40k steps with a batch size of 1024 on a single v4-64 TPU pod, which takes 17 hours. ... We train with a batch size of 256 for 445k steps on a single v4-8 TPU VM, which takes 15 hours.
Software Dependencies No The paper mentions software components like Instruct Pix2Pix, OWL-ViT, Flan-T5-Base, CLIP, MUSE, and DDIM sampler, but does not provide specific version numbers for these or other key software dependencies.
Experiment Setup Yes We finetune Instruct Pix2Pix [9] using similar hyperparameters to the initial Instruct Pix2Pix training. We use the Adam W optimizer [40] with a learning rate of 1e-4, a linear warmup of 800 steps, and weight decay of 0.01. ... At test time, we use an image guidance weight of 2.5 and a text guidance weight of 7.5. We use the DDIM sampler [56] with 50 sampling steps. ... We use the Adam optimizer [36] with a learning rate of 3e-4 and a linear warmup of 2000 steps. We train with a batch size of 256 for 445k steps ... We augment the observation and goal with random crops, random resizing, and color jitter.