reproducibilityindex.ai

RoboMamba: Efficient Vision-Language-Action Model for Robotic Reasoning and Manipulation

Authors: Jiaming Liu, Mengzhen Liu, Zhenyu Wang, Pengju An, Xiaoqi Li, Kaichen Zhou, Senqiao Yang, Renrui Zhang, Yandong Guo, Shanghang Zhang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In experiments, Robo Mamba demonstrates outstanding reasoning capabilities on general and robotic evaluation benchmarks. Meanwhile, our model showcases impressive pose prediction results in both simulation and real-world experiments, achieving inference speeds 3 times faster than existing VLA models. Our project web page: https://sites.google.com/view/robomamba-web
Researcher Affiliation	Collaboration	Jiaming Liu1, Mengzhen Liu1 , Zhenyu Wang 1, Pengju An1, Xiaoqi Li 1, Kaichen Zhou1, Senqiao Yang1, Renrui Zhang , Yandong Guo2, Shanghang Zhang1,3 1State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University; 2AI2Robotics; 3Beijing Academy of Artificial Intelligence (BAAI)
Pseudocode	No	The paper contains architectural diagrams (Figure 2) and detailed descriptions of the training pipeline, but no explicit pseudocode or algorithm blocks labeled as such.
Open Source Code	No	We will open source the code as soon as it is ready.
Open Datasets	Yes	In the alignment pre-training stage, we utilize the LLa VA-LCS 558K dataset [67], which is a curated subset of the LAION-CC-SBU dataset, supplemented with captions. During the instruction co-training stage, we combine general instruction datasets with the robotic instruction datasets. Specifically, for the general instruction dataset, we selectively adopt the LLa VA mixed instruction dataset [4], the Share GPT4V-SFT dataset [68], or the LLa VA-Next dataset [69]. For the robotic instruction dataset, we randomly sample some image-text paired training samples from the Robo VQA [27] dataset. In our main experiments, a mixture of the LLa VA 1.5 instruction dataset and the 300K Robo VQA dataset is used during the co-training stage. For the dataset used in the robot manipulation fine-tuning stage, we follow the data collection process of previous works [61, 15], adopting the SAPIEN engine [28] to set up an interactive simulation environment with articulated objects from Part Net-Mobility [58].
Dataset Splits	Yes	For the training set, we collect 10K images across 20 tasks. For evaluation, we generate 1.1K examples for the test set, comprising 20 training (seen) and 10 testing (unseen) tasks. The unseen tasks are used to evaluate the generalization capability of our model.
Hardware Specification	Yes	All experiments are conducted on NVIDIA A100 GPUs.
Software Dependencies	No	The paper mentions using "Adam W optimizer" and discusses "pre-trained CLIP/Sig LIP Vi T-Large" and "Mamba" models, but does not specify version numbers for Python, PyTorch, or other key software libraries used for implementation.
Experiment Setup	Yes	During the alignment pre-training and instruction co-training, we conduct training for 1 epoch and 2 epochs, respectively. We utilize the Adam W optimizer with (β1, β2) = (0.9, 0.999) and a learning rate (LR) of 4e-5. The precision of floating-point calculations is set to 16-bit. For manipulation fine-tuning, we train the model for 8 epochs, setting the LR to 1e-5 and applying a weight decay of 0.1. The floating-point precision is set to 32-bit.