Closed-Loop Visuomotor Control with Generative Expectation for Robotic Manipulation

Authors: Qingwen Bu, Jia Zeng, Li Chen, Yanchao Yang, Guyue Zhou, Junchi Yan, Ping Luo, Heming Cui, Yi Ma, Hongyang Li

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments in simulation and real-world robots verify the effectiveness of CLOVER. It surpasses prior state-of-the-arts by a notable margin (+8%) on CALVIN.
Researcher Affiliation Collaboration Qingwen Bu1,2, Jia Zeng1, Li Chen1,3, Yanchao Yang3, Guyue Zhou4 Junchi Yan2 Ping Luo3 Heming Cui3 Yi Ma3 Hongyang Li1, 1 Shanghai AI Lab 2 Shanghai Jiao Tong University 3 HKU 4 Tsinghua University Equal contribution Corresponding authors
Pseudocode Yes Algorithm 1: CLOVER: Test-time Execution
Open Source Code Yes Code and checkpoints are maintained at https://github.com/OpenDriveLab/CLOVER.
Open Datasets Yes We conduct the majority of our experiments using CALVIN [54], an evaluation benchmark designed for long-horizon, language-conditioned manipulation.
Dataset Splits Yes We train policy models on demonstrations collected from environments A, B, and C, and conduct zero-shot evaluations in environment D.
Hardware Specification Yes Models are trained on a system equipped with 8 A100 GPUs with the batch size set as 32.
Software Dependencies No The paper mentions software components like Imagen, CLIP, and RAFT, but does not specify their version numbers or the versions of other ancillary software dependencies like Python or PyTorch.
Experiment Setup Yes Our diffusion model-based planner can be factorized as pϕ(O1:K | O0, cl), with cl presenting the condition given by the language. During the training process, it acts as a denoising function ϵϕ predicting noises applied on future video frames O1:K [68]. Given the noise scheduling βt, the training objective of the diffusion model is: k=1 ϵ ϵϕ( p 1 βt Ok + p βt ϵ | t, cl) 2, (3) where the noise ϵ {ϵRGB, ϵDepth} is drawn from a multivariate standard Gaussian distribution, and t represents a randomly selected diffusion step. Specifically, we sample noises separately for RGB and depth from two independent distributions. We further adopt the min-SNR weighting strategy [69] to speed up convergence. Combining the flow-based regularization term in Section 3.1, the final optimization objective of the visual planner can be formulated as: Lplanner = Ldiff + λLreg, (4) where λ is a balancing factor and is set to 0.1 by default. In our experiments on the CALVIN [54] benchmark, we train the diffusion model for 300k iterations with a learning rate of 1e-4. Models are trained on a system equipped with 8 A100 GPUs with the batch size set as 32. We adopt the Adam W optimizer without weight decay. Besides, we track an exponential moving average (EMA) of the model parameters with a decay rate of 0.999 and use the EMA parameters at test time. For real-world experiments, we tune the diffusion model for 50,000 iterations on 50 collected demonstrations. Due to hardware limitations, we are not able to collect depth data in a real environment, so the model generates RGB images only. For test-time execution, the DDIM sampler [70] is employed, with 20 sampling steps to strike a balance between efficiency and quality. The text guidance weight is set to 4 for generating visual plans that align with linguistic descriptions. Feedback-driven policy. To optimize the policy model, we leverage mean squared error and binary cross-entropy loss to supervise the end-effector s position a EE R6 and gripper state agripper R1, respectively. In each training episode, two frames with an interval ranging from 1 to kmax = 5 are sampled as inputs to enhance the model s robustness. We train the policy on ABC training split of CALVIN for 10 epochs with a batch size of 128. Only the relative cartesian action of a single timestamp is used for training. The training process takes around 10 hours on 8 A100 GPUs.