Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Any2Policy: Learning Visuomotor Policy with Any-Modality
Authors: Yichen Zhu, Zhicai Ou, Feifei Feng, Jian Tang
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conducted extensive validation of our proposed unified modality embodied agent using several simulation benchmarks, including Franka Kitchen and Maniskill2, as well as in our real-world settings. Our experiments showcase the promising capability of building embodied agents that can adapt to diverse multi-modal in a unified framework. |
| Researcher Affiliation | Industry | Yichen Zhu, Zhicai Ou, Feifei Feng, Jian Tang Midea Group |
| Pseudocode | No | The paper contains architectural diagrams (Figure 1, Figure 2) but no explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor structured code-like procedures. |
| Open Source Code | Yes | Our project is at any2policy.github.io/. The data will be attached in the webpage. |
| Open Datasets | Yes | In support of this project, we are releasing a substantial real-world dataset consisting of 30 tasks, where each task includes 30 trajectories, all annotated with multi-modal instructions and observations, mirroring the setup used in our experiments. The purpose of this dataset is to foster and encourage future research in the area of multi-modal embodied agents. Our dataset, Robo Any, stands out as the first to support a comprehensive range of modalities in robotics. Specifically, Franka Kitchen [92] uses text-image and Mani Skill2 [94] uses text-image and text-{image, point cloud}. The data will be attached in the webpage. |
| Dataset Splits | Yes | The dataset is divided into training, validation, and testing subsets, with a split of 7/1/2, respectively. |
| Hardware Specification | Yes | All models are trained on A100 GPUs, implemented in Py Torch [111]. We report the computer resources. |
| Software Dependencies | No | All models are trained on A100 GPUs, implemented in Py Torch [111]. |
| Experiment Setup | Yes | We use an initial learning rate of 3e-5 with the Adam W [107] optimizer, a weight decay of 1e-6, and a linearly decaying learning rate scheduler with a warm-up covering the initial 2% of the total training time [108]. We apply a gradient clipping of 1.0. The Franka-Kitchen are trained for 40K steps. We use weight decay of 1e-6, cosine learning rate scheduler with warmup steps of 2% total steps. The gradient clip of 1.0 is also applied. We use Adam optimizer with initial learning of 1e-3 and 3e-4 for Franka Kitchen and Maniskill-2. |