VLMimic: Vision Language Models are Visual Imitation Learner for Fine-grained Actions
Authors: Guangyan Chen, Meiling Wang, Te Cui, Yao Mu, Haoyang Lu, Tianxing Zhou, Zicai Peng, Mengxiao Hu, Haizhou Li, Li Yuan, Yi Yang, Yufeng Yue
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive experiments exhibit that our VLMimic, using only 5 human videos, yields significant improvements of over 27% and 21% in RLBench and real-world manipulation tasks, and surpasses baselines by over 37% in long-horizon tasks. |
| Researcher Affiliation | Academia | Guangyan Chen1 Meiling Wang1 Te Cui1 Yao Mu2 Haoyang Lu1 Tianxing Zhou1 Zicai Peng1 Mengxiao Hu1 Haizhou Li1 Li Yuan3 Yi Yang1 Yufeng Yue1 1 Beijing Institute of Technology 2 The University of Hong Kong 3 Peking University |
| Pseudocode | No | Not found. The paper describes processes and uses figures to illustrate the pipeline, but does not include formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and videos are available at our home page. |
| Open Datasets | Yes | Our evaluation encompasses 12 manipulation tasks, as detailed in Table 1, demonstrating that our method surpasses all other methods in 11 out of these tasks. Our method, learned with only 5 human videos, obviously outperforms R3M-DP and DP by over 61% in overall performance, despite both being trained on 100 robot demonstrations. |
| Dataset Splits | No | The real-world testing environment (E) is divided into "seen" (SE) and "unseen" (UE) categories. The "seen" category allows for testing in the environment where demonstrations were collected, whereas the "unseen" category involves testing in a distinct environment characterized by different objects and layouts. |
| Hardware Specification | Yes | Experiments are conducted on Franka Emika [71], employing three RGB-D cameras (ORBBEC Femto Bolt)... All experiments are evaluated on an Intel i7-10700 CPU with an RTX 3090 graphics card. |
| Software Dependencies | No | In human-object interaction grounding module, the Tokenize Anything [44] model is employed... SAM-Track [45; 46; 47; 48; 49] predicts hand and task-related object masks... Frank Mocap [50] and the Iterative Closest Point (ICP) algorithm [51; 52] are employed... Bundle SDF [53] is employed for object reconstruction, and Foundation Pose [54] is leveraged... The robotic arm s motion planning is facilitated by the integration of the Move It module... and the OMPL [58] (Open Motion Planning Library)... the pretrained Grounded-segment-any-parts model [69; 70] is used... |
| Experiment Setup | Yes | The videos are segmented using a threshold ϵ of 2cm. Segments with hand motion trajectory lengths below γ = 10cm are discarded. During the grasping constraint learning phase, the number of regions Nc is automatically determined by the VLMs. In manipulation constraint learning, keypoints are obtained by uniformly sampling 10 points. For the skill adapter, the maximum number of iterations is set to NI = 4. During grasping constraint adaptation, visualized grasping position space is discretized into a 5 5 grid, with K = 4 outputs sampled per iteration. |