reproducibilityindex.ai

CO-PILOT: COllaborative Planning and reInforcement Learning On sub-Task curriculum

Authors: Shuang Ao, Tianyi Zhou, Guodong Long, Qinghua Lu, Liming Zhu, Jing Jiang

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We compare CO-PILOT with RL (SAC, HER, PPO), planning (RRT*, NEXT, SGT), and their combination (So RB) on navigation and continuous control tasks. CO-PILOT signiﬁcantly improves the success rate and sample efﬁciency.
Researcher Affiliation	Collaboration	University of Technology Sydney1; University of Washington, Seattle2; University of Maryland, College Park3; CSIRO s Data61, Australia4
Pseudocode	Yes	Algorithm 1 Top-Down Construction of Sub-Task Tree, Algorithm 2 Bottom-Up Traversal of Sub-Task Tree, Algorithm 3 CO-PILOT
Open Source Code	Yes	Our code is available at https://github.com/Shuang-AO/CO-PILOT.
Open Datasets	No	The paper describes generating training samples within simulated environments (Maze, Mujoco Ant-v1, Bipedal Walker) rather than using a pre-existing publicly available dataset with specific access information.
Dataset Splits	No	The paper specifies training and test splits, such as "300 pairs of (s0, g) for training and 100 pairs for test", but does not explicitly mention a separate validation set or a three-way split.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts) used for running its experiments.
Software Dependencies	No	The paper mentions software like SAC, PPO, and Adam optimizer, but does not provide specific version numbers for these or any other ancillary software dependencies.
Experiment Setup	Yes	In CO-PILOT, we initialize the dataset Dτ with 50,000 tuples of (g, g , τg,g ) with τg,g being the Euclidean distance. We set a reward of 1 (1000, 200) to each task (s0, g) in Maze (Mujoco,Bipedal Walker)... For planning policy training, we apply PPO with a trust region of ϵ = 0.2 and use Adam optimizer [30] with a learning rate of 0.005. For RL training with SAC, we use its default hyperparameters... We set τmax = 25, τmax = 200 and τmax = 2000 for Maze, Mujoco and Bipedal Walker respectively.