CO-PILOT: COllaborative Planning and reInforcement Learning On sub-Task curriculum

Authors: Shuang Ao, Tianyi Zhou, Guodong Long, Qinghua Lu, Liming Zhu, Jing Jiang

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We compare CO-PILOT with RL (SAC, HER, PPO), planning (RRT*, NEXT, SGT), and their combination (So RB) on navigation and continuous control tasks. CO-PILOT significantly improves the success rate and sample efficiency.
Researcher Affiliation Collaboration University of Technology Sydney1; University of Washington, Seattle2; University of Maryland, College Park3; CSIRO s Data61, Australia4
Pseudocode Yes Algorithm 1 Top-Down Construction of Sub-Task Tree, Algorithm 2 Bottom-Up Traversal of Sub-Task Tree, Algorithm 3 CO-PILOT
Open Source Code Yes Our code is available at https://github.com/Shuang-AO/CO-PILOT.
Open Datasets No The paper describes generating training samples within simulated environments (Maze, Mujoco Ant-v1, Bipedal Walker) rather than using a pre-existing publicly available dataset with specific access information.
Dataset Splits No The paper specifies training and test splits, such as "300 pairs of (s0, g) for training and 100 pairs for test", but does not explicitly mention a separate validation set or a three-way split.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts) used for running its experiments.
Software Dependencies No The paper mentions software like SAC, PPO, and Adam optimizer, but does not provide specific version numbers for these or any other ancillary software dependencies.
Experiment Setup Yes In CO-PILOT, we initialize the dataset Dτ with 50,000 tuples of (g, g , τg,g ) with τg,g being the Euclidean distance. We set a reward of 1 (1000, 200) to each task (s0, g) in Maze (Mujoco,Bipedal Walker)... For planning policy training, we apply PPO with a trust region of ϵ = 0.2 and use Adam optimizer [30] with a learning rate of 0.005. For RL training with SAC, we use its default hyperparameters... We set τmax = 25, τmax = 200 and τmax = 2000 for Maze, Mujoco and Bipedal Walker respectively.