RoboMamba: Efficient Vision-Language-Action Model for Robotic Reasoning and Manipulation
Authors: Jiaming Liu, Mengzhen Liu, Zhenyu Wang, Pengju An, Xiaoqi Li, Kaichen Zhou, Senqiao Yang, Renrui Zhang, Yandong Guo, Shanghang Zhang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In experiments, Robo Mamba demonstrates outstanding reasoning capabilities on general and robotic evaluation benchmarks. Meanwhile, our model showcases impressive pose prediction results in both simulation and real-world experiments, achieving inference speeds 3 times faster than existing VLA models. Our project web page: https://sites.google.com/view/robomamba-web |
| Researcher Affiliation | Collaboration | Jiaming Liu1, Mengzhen Liu1 , Zhenyu Wang 1, Pengju An1, Xiaoqi Li 1, Kaichen Zhou1, Senqiao Yang1, Renrui Zhang , Yandong Guo2, Shanghang Zhang1,3 1State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University; 2AI2Robotics; 3Beijing Academy of Artificial Intelligence (BAAI) |
| Pseudocode | No | The paper contains architectural diagrams (Figure 2) and detailed descriptions of the training pipeline, but no explicit pseudocode or algorithm blocks labeled as such. |
| Open Source Code | No | We will open source the code as soon as it is ready. |
| Open Datasets | Yes | In the alignment pre-training stage, we utilize the LLa VA-LCS 558K dataset [67], which is a curated subset of the LAION-CC-SBU dataset, supplemented with captions. During the instruction co-training stage, we combine general instruction datasets with the robotic instruction datasets. Specifically, for the general instruction dataset, we selectively adopt the LLa VA mixed instruction dataset [4], the Share GPT4V-SFT dataset [68], or the LLa VA-Next dataset [69]. For the robotic instruction dataset, we randomly sample some image-text paired training samples from the Robo VQA [27] dataset. In our main experiments, a mixture of the LLa VA 1.5 instruction dataset and the 300K Robo VQA dataset is used during the co-training stage. For the dataset used in the robot manipulation fine-tuning stage, we follow the data collection process of previous works [61, 15], adopting the SAPIEN engine [28] to set up an interactive simulation environment with articulated objects from Part Net-Mobility [58]. |
| Dataset Splits | Yes | For the training set, we collect 10K images across 20 tasks. For evaluation, we generate 1.1K examples for the test set, comprising 20 training (seen) and 10 testing (unseen) tasks. The unseen tasks are used to evaluate the generalization capability of our model. |
| Hardware Specification | Yes | All experiments are conducted on NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions using "Adam W optimizer" and discusses "pre-trained CLIP/Sig LIP Vi T-Large" and "Mamba" models, but does not specify version numbers for Python, PyTorch, or other key software libraries used for implementation. |
| Experiment Setup | Yes | During the alignment pre-training and instruction co-training, we conduct training for 1 epoch and 2 epochs, respectively. We utilize the Adam W optimizer with (β1, β2) = (0.9, 0.999) and a learning rate (LR) of 4e-5. The precision of floating-point calculations is set to 16-bit. For manipulation fine-tuning, we train the model for 8 epochs, setting the LR to 1e-5 and applying a weight decay of 0.1. The floating-point precision is set to 32-bit. |