CO-PILOT: COllaborative Planning and reInforcement Learning On sub-Task curriculum
Authors: Shuang Ao, Tianyi Zhou, Guodong Long, Qinghua Lu, Liming Zhu, Jing Jiang
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compare CO-PILOT with RL (SAC, HER, PPO), planning (RRT*, NEXT, SGT), and their combination (So RB) on navigation and continuous control tasks. CO-PILOT significantly improves the success rate and sample efficiency. |
| Researcher Affiliation | Collaboration | University of Technology Sydney1; University of Washington, Seattle2; University of Maryland, College Park3; CSIRO s Data61, Australia4 |
| Pseudocode | Yes | Algorithm 1 Top-Down Construction of Sub-Task Tree, Algorithm 2 Bottom-Up Traversal of Sub-Task Tree, Algorithm 3 CO-PILOT |
| Open Source Code | Yes | Our code is available at https://github.com/Shuang-AO/CO-PILOT. |
| Open Datasets | No | The paper describes generating training samples within simulated environments (Maze, Mujoco Ant-v1, Bipedal Walker) rather than using a pre-existing publicly available dataset with specific access information. |
| Dataset Splits | No | The paper specifies training and test splits, such as "300 pairs of (s0, g) for training and 100 pairs for test", but does not explicitly mention a separate validation set or a three-way split. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions software like SAC, PPO, and Adam optimizer, but does not provide specific version numbers for these or any other ancillary software dependencies. |
| Experiment Setup | Yes | In CO-PILOT, we initialize the dataset Dτ with 50,000 tuples of (g, g , τg,g ) with τg,g being the Euclidean distance. We set a reward of 1 (1000, 200) to each task (s0, g) in Maze (Mujoco,Bipedal Walker)... For planning policy training, we apply PPO with a trust region of ϵ = 0.2 and use Adam optimizer [30] with a learning rate of 0.005. For RL training with SAC, we use its default hyperparameters... We set τmax = 25, τmax = 200 and τmax = 2000 for Maze, Mujoco and Bipedal Walker respectively. |