PIVOT-R: Primitive-Driven Waypoint-Aware World Model for Robotic Manipulation

Authors: Kaidong Zhang, Pengzhen Ren, Bingqian Lin, Junfan Lin, Shikui Ma, Hang Xu, Xiaodan Liang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our PIVOT-R outperforms state-of-the-art (So TA) open-source models on the Sea Wave benchmark, achieving an average relative improvement of 19.45% across four levels of instruction tasks. Moreover, compared to the synchronously executed PIVOT-R, the execution efficiency of PIVOT-R with AHE is increased by 28-fold, with only a 2.9% drop in performance. These results provide compelling evidence that our PIVOT-R can significantly improve both the performance and efficiency of robotic manipulation.
Researcher Affiliation Collaboration Kaidong Zhang1 Pengzhen Ren2 Bingqian Lin1 Junfan Lin2 Shikui Ma3 Hang Xu4 Xiaodan Liang1,2 1Sun Yat-sen University 2Peng Cheng Laboratory 3Dataa Robotics 4Huawei Noah s Ark Lab
Pseudocode No The paper describes its architecture and processes using text and diagrams (Figure 1, Figure 2), but it does not include explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes https://abliao.github.io/PIVOT-R and from the NeurIPS checklist: "Our work will be open source after acceptance."
Open Datasets Yes We choose Sea Wave [42], an open-source benchmark to learn multi-level instruction tasks, as our experimental platform, and use the corresponding data as demonstration data for imitation learning. ... The Sea Wave dataset contains a total of 13K data covering four different levels of language instructions.
Dataset Splits No The paper states 'We train on this dataset and test on a specially divided test set' but does not explicitly mention or detail a validation dataset split.
Hardware Specification Yes All experiments involved in this paper are conducted on a single GPU server with 6 NVIDIA RTX4090 GPUs.
Software Dependencies No The paper mentions software like LLAVA and CLIP, but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes The hyperparameter settings for PIVOT-R are shown in Table 6. LS 12, LA 3, Image encoder CLIP-Vi T-B/32, Text encoder CLIP-Vi T-B/32, Transformers heads 8, Embedded dims 512, Learning rate 3e-5, dropout 0.1.