reproducibilityindex.ai

Composing Task-Agnostic Policies with Deep Reinforcement Learning

Authors: Ahmed H. Qureshi, Jacob J. Johnson, Yuzhe Qin, Taylor Henderson, Byron Boots, Michael C. Yip

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our method in difﬁcult cases where training policy through standard reinforcement learning (RL) or even hierarchical RL is either not feasible or exhibits high sample complexity. We show that our method not only transfers skills to new problem settings but also solves the challenging environments requiring both task planning and motion control with high data efﬁciency.
Researcher Affiliation	Academia	Ahmed H. Qureshi UC San Diego a1qureshi@ucsd.edu Jacob J. Johnson UC San Diego jjj025@eng.ucsd.edu Yuzhe Qin UC San Diego y1qin@ucsd.edu Taylor Henderson UC San Diego tjwest@ucsd.edu Byron Boots University of Washington bboots@cs.washington.edu Michael C. Yip UC San Diego yip@ucsd.edu
Pseudocode	Yes	Algorithm 1: Composition model training using SAC [...] Algorithm 2: Composition model training using HIRO
Open Source Code	Yes	1Supplementary material and videos are available at https://sites.google.com/view/compositional-rl
Open Datasets	No	The paper describes simulation environments (e.g., Ant, Halfcheetah, Pusher) where agents learn through interaction. It details how goals are sampled during training and how the agent is evaluated during testing for specific tasks (e.g., Ant Maze: 'During training, the goal is uniformly sampled from [-4, 20] [-4, 20] space, and the Ant initial location is always ﬁxed at (0, 0). During testing, the agent is evaluated to reach the farthest end of the maze located at (0, 19) within L2 distance of 5.'). However, it does not explicitly state the use of a pre-existing, publicly available dataset with a concrete access link, DOI, or formal citation.
Dataset Splits	No	The paper describes training and testing procedures in simulation environments but does not explicitly mention distinct 'validation' splits, percentages, or sample counts, nor does it refer to standard validation set partitions from established benchmarks.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions algorithms and frameworks like SAC, TRPO, PPO, HIRO, TD3, and Mujoco but does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	The implementation details of all presented methods and environment settings are provided in Appendix C of supplementary material. [...] Table 2 summarizes the hyperparameters used to train policies with SAC (Haarnoja et al., 2018b), TRPO (Schulman et al., 2015), PPO (Schulman et al., 2017), and HIRO (Nachum et al., 2018). [...] Table 3 summarizes the network architectures.