Composing Task-Agnostic Policies with Deep Reinforcement Learning

Authors: Ahmed H. Qureshi, Jacob J. Johnson, Yuzhe Qin, Taylor Henderson, Byron Boots, Michael C. Yip

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our method in difficult cases where training policy through standard reinforcement learning (RL) or even hierarchical RL is either not feasible or exhibits high sample complexity. We show that our method not only transfers skills to new problem settings but also solves the challenging environments requiring both task planning and motion control with high data efficiency.
Researcher Affiliation Academia Ahmed H. Qureshi UC San Diego a1qureshi@ucsd.edu Jacob J. Johnson UC San Diego jjj025@eng.ucsd.edu Yuzhe Qin UC San Diego y1qin@ucsd.edu Taylor Henderson UC San Diego tjwest@ucsd.edu Byron Boots University of Washington bboots@cs.washington.edu Michael C. Yip UC San Diego yip@ucsd.edu
Pseudocode Yes Algorithm 1: Composition model training using SAC [...] Algorithm 2: Composition model training using HIRO
Open Source Code Yes 1Supplementary material and videos are available at https://sites.google.com/view/compositional-rl
Open Datasets No The paper describes simulation environments (e.g., Ant, Halfcheetah, Pusher) where agents learn through interaction. It details how goals are sampled during training and how the agent is evaluated during testing for specific tasks (e.g., Ant Maze: 'During training, the goal is uniformly sampled from [-4, 20] [-4, 20] space, and the Ant initial location is always fixed at (0, 0). During testing, the agent is evaluated to reach the farthest end of the maze located at (0, 19) within L2 distance of 5.'). However, it does not explicitly state the use of a pre-existing, publicly available *dataset* with a concrete access link, DOI, or formal citation.
Dataset Splits No The paper describes training and testing procedures in simulation environments but does not explicitly mention distinct 'validation' splits, percentages, or sample counts, nor does it refer to standard validation set partitions from established benchmarks.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions algorithms and frameworks like SAC, TRPO, PPO, HIRO, TD3, and Mujoco but does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes The implementation details of all presented methods and environment settings are provided in Appendix C of supplementary material. [...] Table 2 summarizes the hyperparameters used to train policies with SAC (Haarnoja et al., 2018b), TRPO (Schulman et al., 2015), PPO (Schulman et al., 2017), and HIRO (Nachum et al., 2018). [...] Table 3 summarizes the network architectures.