Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
CO-PILOT: COllaborative Planning and reInforcement Learning On sub-Task curriculum
Authors: Shuang Ao, Tianyi Zhou, Guodong Long, Qinghua Lu, Liming Zhu, Jing Jiang
NeurIPS 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compare CO-PILOT with RL (SAC, HER, PPO), planning (RRT*, NEXT, SGT), and their combination (So RB) on navigation and continuous control tasks. CO-PILOT significantly improves the success rate and sample efficiency. |
| Researcher Affiliation | Collaboration | University of Technology Sydney1; University of Washington, Seattle2; University of Maryland, College Park3; CSIRO s Data61, Australia4 |
| Pseudocode | Yes | Algorithm 1 Top-Down Construction of Sub-Task Tree, Algorithm 2 Bottom-Up Traversal of Sub-Task Tree, Algorithm 3 CO-PILOT |
| Open Source Code | Yes | Our code is available at https://github.com/Shuang-AO/CO-PILOT. |
| Open Datasets | No | The paper describes generating training samples within simulated environments (Maze, Mujoco Ant-v1, Bipedal Walker) rather than using a pre-existing publicly available dataset with specific access information. |
| Dataset Splits | No | The paper specifies training and test splits, such as "300 pairs of (s0, g) for training and 100 pairs for test", but does not explicitly mention a separate validation set or a three-way split. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions software like SAC, PPO, and Adam optimizer, but does not provide specific version numbers for these or any other ancillary software dependencies. |
| Experiment Setup | Yes | In CO-PILOT, we initialize the dataset Dτ with 50,000 tuples of (g, g , τg,g ) with τg,g being the Euclidean distance. We set a reward of 1 (1000, 200) to each task (s0, g) in Maze (Mujoco,Bipedal Walker)... For planning policy training, we apply PPO with a trust region of ϵ = 0.2 and use Adam optimizer [30] with a learning rate of 0.005. For RL training with SAC, we use its default hyperparameters... We set τmax = 25, τmax = 200 and τmax = 2000 for Maze, Mujoco and Bipedal Walker respectively. |