VLMimic: Vision Language Models are Visual Imitation Learner for Fine-grained Actions

Authors: Guangyan Chen, Meiling Wang, Te Cui, Yao Mu, Haoyang Lu, Tianxing Zhou, Zicai Peng, Mengxiao Hu, Haizhou Li, Li Yuan, Yi Yang, Yufeng Yue

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive experiments exhibit that our VLMimic, using only 5 human videos, yields significant improvements of over 27% and 21% in RLBench and real-world manipulation tasks, and surpasses baselines by over 37% in long-horizon tasks.
Researcher Affiliation Academia Guangyan Chen1 Meiling Wang1 Te Cui1 Yao Mu2 Haoyang Lu1 Tianxing Zhou1 Zicai Peng1 Mengxiao Hu1 Haizhou Li1 Li Yuan3 Yi Yang1 Yufeng Yue1 1 Beijing Institute of Technology 2 The University of Hong Kong 3 Peking University
Pseudocode No Not found. The paper describes processes and uses figures to illustrate the pipeline, but does not include formal pseudocode or algorithm blocks.
Open Source Code Yes Code and videos are available at our home page.
Open Datasets Yes Our evaluation encompasses 12 manipulation tasks, as detailed in Table 1, demonstrating that our method surpasses all other methods in 11 out of these tasks. Our method, learned with only 5 human videos, obviously outperforms R3M-DP and DP by over 61% in overall performance, despite both being trained on 100 robot demonstrations.
Dataset Splits No The real-world testing environment (E) is divided into "seen" (SE) and "unseen" (UE) categories. The "seen" category allows for testing in the environment where demonstrations were collected, whereas the "unseen" category involves testing in a distinct environment characterized by different objects and layouts.
Hardware Specification Yes Experiments are conducted on Franka Emika [71], employing three RGB-D cameras (ORBBEC Femto Bolt)... All experiments are evaluated on an Intel i7-10700 CPU with an RTX 3090 graphics card.
Software Dependencies No In human-object interaction grounding module, the Tokenize Anything [44] model is employed... SAM-Track [45; 46; 47; 48; 49] predicts hand and task-related object masks... Frank Mocap [50] and the Iterative Closest Point (ICP) algorithm [51; 52] are employed... Bundle SDF [53] is employed for object reconstruction, and Foundation Pose [54] is leveraged... The robotic arm s motion planning is facilitated by the integration of the Move It module... and the OMPL [58] (Open Motion Planning Library)... the pretrained Grounded-segment-any-parts model [69; 70] is used...
Experiment Setup Yes The videos are segmented using a threshold ϵ of 2cm. Segments with hand motion trajectory lengths below γ = 10cm are discarded. During the grasping constraint learning phase, the number of regions Nc is automatically determined by the VLMs. In manipulation constraint learning, keypoints are obtained by uniformly sampling 10 points. For the skill adapter, the maximum number of iterations is set to NI = 4. During grasping constraint adaptation, visualized grasping position space is discretized into a 5 5 grid, with K = 4 outputs sampled per iteration.