Composing Task-Agnostic Policies with Deep Reinforcement Learning
Authors: Ahmed H. Qureshi, Jacob J. Johnson, Yuzhe Qin, Taylor Henderson, Byron Boots, Michael C. Yip
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our method in difficult cases where training policy through standard reinforcement learning (RL) or even hierarchical RL is either not feasible or exhibits high sample complexity. We show that our method not only transfers skills to new problem settings but also solves the challenging environments requiring both task planning and motion control with high data efficiency. |
| Researcher Affiliation | Academia | Ahmed H. Qureshi UC San Diego a1qureshi@ucsd.edu Jacob J. Johnson UC San Diego jjj025@eng.ucsd.edu Yuzhe Qin UC San Diego y1qin@ucsd.edu Taylor Henderson UC San Diego tjwest@ucsd.edu Byron Boots University of Washington bboots@cs.washington.edu Michael C. Yip UC San Diego yip@ucsd.edu |
| Pseudocode | Yes | Algorithm 1: Composition model training using SAC [...] Algorithm 2: Composition model training using HIRO |
| Open Source Code | Yes | 1Supplementary material and videos are available at https://sites.google.com/view/compositional-rl |
| Open Datasets | No | The paper describes simulation environments (e.g., Ant, Halfcheetah, Pusher) where agents learn through interaction. It details how goals are sampled during training and how the agent is evaluated during testing for specific tasks (e.g., Ant Maze: 'During training, the goal is uniformly sampled from [-4, 20] [-4, 20] space, and the Ant initial location is always fixed at (0, 0). During testing, the agent is evaluated to reach the farthest end of the maze located at (0, 19) within L2 distance of 5.'). However, it does not explicitly state the use of a pre-existing, publicly available *dataset* with a concrete access link, DOI, or formal citation. |
| Dataset Splits | No | The paper describes training and testing procedures in simulation environments but does not explicitly mention distinct 'validation' splits, percentages, or sample counts, nor does it refer to standard validation set partitions from established benchmarks. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions algorithms and frameworks like SAC, TRPO, PPO, HIRO, TD3, and Mujoco but does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | The implementation details of all presented methods and environment settings are provided in Appendix C of supplementary material. [...] Table 2 summarizes the hyperparameters used to train policies with SAC (Haarnoja et al., 2018b), TRPO (Schulman et al., 2015), PPO (Schulman et al., 2017), and HIRO (Nachum et al., 2018). [...] Table 3 summarizes the network architectures. |