Any2Policy: Learning Visuomotor Policy with Any-Modality

Authors: Yichen Zhu, Zhicai Ou, Feifei Feng, Jian Tang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conducted extensive validation of our proposed unified modality embodied agent using several simulation benchmarks, including Franka Kitchen and Maniskill2, as well as in our real-world settings. Our experiments showcase the promising capability of building embodied agents that can adapt to diverse multi-modal in a unified framework.
Researcher Affiliation Industry Yichen Zhu, Zhicai Ou, Feifei Feng, Jian Tang Midea Group
Pseudocode No The paper contains architectural diagrams (Figure 1, Figure 2) but no explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor structured code-like procedures.
Open Source Code Yes Our project is at any2policy.github.io/. The data will be attached in the webpage.
Open Datasets Yes In support of this project, we are releasing a substantial real-world dataset consisting of 30 tasks, where each task includes 30 trajectories, all annotated with multi-modal instructions and observations, mirroring the setup used in our experiments. The purpose of this dataset is to foster and encourage future research in the area of multi-modal embodied agents. Our dataset, Robo Any, stands out as the first to support a comprehensive range of modalities in robotics. Specifically, Franka Kitchen [92] uses text-image and Mani Skill2 [94] uses text-image and text-{image, point cloud}. The data will be attached in the webpage.
Dataset Splits Yes The dataset is divided into training, validation, and testing subsets, with a split of 7/1/2, respectively.
Hardware Specification Yes All models are trained on A100 GPUs, implemented in Py Torch [111]. We report the computer resources.
Software Dependencies No All models are trained on A100 GPUs, implemented in Py Torch [111].
Experiment Setup Yes We use an initial learning rate of 3e-5 with the Adam W [107] optimizer, a weight decay of 1e-6, and a linearly decaying learning rate scheduler with a warm-up covering the initial 2% of the total training time [108]. We apply a gradient clipping of 1.0. The Franka-Kitchen are trained for 40K steps. We use weight decay of 1e-6, cosine learning rate scheduler with warmup steps of 2% total steps. The gradient clip of 1.0 is also applied. We use Adam optimizer with initial learning of 1e-3 and 3e-4 for Franka Kitchen and Maniskill-2.