Closed-Loop Visuomotor Control with Generative Expectation for Robotic Manipulation
Authors: Qingwen Bu, Jia Zeng, Li Chen, Yanchao Yang, Guyue Zhou, Junchi Yan, Ping Luo, Heming Cui, Yi Ma, Hongyang Li
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments in simulation and real-world robots verify the effectiveness of CLOVER. It surpasses prior state-of-the-arts by a notable margin (+8%) on CALVIN. |
| Researcher Affiliation | Collaboration | Qingwen Bu1,2, Jia Zeng1, Li Chen1,3, Yanchao Yang3, Guyue Zhou4 Junchi Yan2 Ping Luo3 Heming Cui3 Yi Ma3 Hongyang Li1, 1 Shanghai AI Lab 2 Shanghai Jiao Tong University 3 HKU 4 Tsinghua University Equal contribution Corresponding authors |
| Pseudocode | Yes | Algorithm 1: CLOVER: Test-time Execution |
| Open Source Code | Yes | Code and checkpoints are maintained at https://github.com/OpenDriveLab/CLOVER. |
| Open Datasets | Yes | We conduct the majority of our experiments using CALVIN [54], an evaluation benchmark designed for long-horizon, language-conditioned manipulation. |
| Dataset Splits | Yes | We train policy models on demonstrations collected from environments A, B, and C, and conduct zero-shot evaluations in environment D. |
| Hardware Specification | Yes | Models are trained on a system equipped with 8 A100 GPUs with the batch size set as 32. |
| Software Dependencies | No | The paper mentions software components like Imagen, CLIP, and RAFT, but does not specify their version numbers or the versions of other ancillary software dependencies like Python or PyTorch. |
| Experiment Setup | Yes | Our diffusion model-based planner can be factorized as pϕ(O1:K | O0, cl), with cl presenting the condition given by the language. During the training process, it acts as a denoising function ϵϕ predicting noises applied on future video frames O1:K [68]. Given the noise scheduling βt, the training objective of the diffusion model is: k=1 ϵ ϵϕ( p 1 βt Ok + p βt ϵ | t, cl) 2, (3) where the noise ϵ {ϵRGB, ϵDepth} is drawn from a multivariate standard Gaussian distribution, and t represents a randomly selected diffusion step. Specifically, we sample noises separately for RGB and depth from two independent distributions. We further adopt the min-SNR weighting strategy [69] to speed up convergence. Combining the flow-based regularization term in Section 3.1, the final optimization objective of the visual planner can be formulated as: Lplanner = Ldiff + λLreg, (4) where λ is a balancing factor and is set to 0.1 by default. In our experiments on the CALVIN [54] benchmark, we train the diffusion model for 300k iterations with a learning rate of 1e-4. Models are trained on a system equipped with 8 A100 GPUs with the batch size set as 32. We adopt the Adam W optimizer without weight decay. Besides, we track an exponential moving average (EMA) of the model parameters with a decay rate of 0.999 and use the EMA parameters at test time. For real-world experiments, we tune the diffusion model for 50,000 iterations on 50 collected demonstrations. Due to hardware limitations, we are not able to collect depth data in a real environment, so the model generates RGB images only. For test-time execution, the DDIM sampler [70] is employed, with 20 sampling steps to strike a balance between efficiency and quality. The text guidance weight is set to 4 for generating visual plans that align with linguistic descriptions. Feedback-driven policy. To optimize the policy model, we leverage mean squared error and binary cross-entropy loss to supervise the end-effector s position a EE R6 and gripper state agripper R1, respectively. In each training episode, two frames with an interval ranging from 1 to kmax = 5 are sampled as inputs to enhance the model s robustness. We train the policy on ABC training split of CALVIN for 10 epochs with a batch size of 128. Only the relative cartesian action of a single timestamp is used for training. The training process takes around 10 hours on 8 A100 GPUs. |