Probabilistic Subgoal Representations for Hierarchical Reinforcement Learning

Authors: Vivienne Huiling Wang, Tinghuai Wang, Wenyan Yang, Joni-Kristian Kamarainen, Joni Pajarinen

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments, our approach outperforms state-of-the-art baselines in standard benchmarks but also in environments with stochastic elements and under diverse reward conditions. Additionally, our model shows promising capabilities in transferring low-level policies across different tasks. 5. Experiments We evaluate our method in challenging environments with dense and sparse external rewards which require a combination of locomotion and object manipulation to demonstrate the effectiveness and transferability of our learned probabilistic subgoal representations. We compare our methods against standard RL and prior HRL methods. We also perform ablative studies to understand the importance of various components.
Researcher Affiliation Collaboration 1Department of Electrical Engineering and Automation, Aalto University, Finland 2Huawei Helsinki Research Center, Finland 3Computing Sciences, Tampere University, Finland.
Pseudocode Yes A.1. Algorithm We provide Algorithm 1 to show the training procedure of HLPS.
Open Source Code Yes Code is available at https://github.com/vi2enne/HLPS
Open Datasets Yes We evaluate our approach on long-horizon continuous control tasks based on Mu Jo Co simulator (Todorov et al., 2012), which are widely adopted in the HRL community. These tasks include Ant Maze, Ant Push, Ant Fall, Ant Four Rooms, two robotic arm environments 7-DOF Reacher and 7-DOF Pusher (Chua et al., 2018), as well as four variants of Maze tasks featuring low-resolution image observations.
Dataset Splits No The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) as data is generated dynamically within the simulation environments. It mentions that 'Hierarchical policies are evaluated every 25000 timesteps by averaging over 10 randomly seeded trials'.
Hardware Specification Yes All of the experiments were processed using a single GPU (Tesla V100) and 8 CPU cores (Intel Xeon Gold 6278C @ 2.60GHz) with 64 GB RAM.
Software Dependencies No The paper mentions using SAC (Haarnoja et al., 2018) for each level in the HRL structure and Adam optimizer. However, it does not provide specific version numbers for these or other software libraries (e.g., Python, PyTorch/TensorFlow versions).
Experiment Setup Yes A.4.1. TRAINING AND EVALUATION PARAMETERS Learning rate of latent GP 1e 5 Latent GP update frequency 100 Batch GP scheme time window size 3 Subgoal dimension of size 2 Learning rate 0.0002 for actor/critic of both levels Interval of high-level actions k = 50 Target network smoothing coefficient 0.005 Reward scaling 0.1 for both levels Discount factor γ = 0.99 for both levels Learning rate for encoding layer 0.0001 Hierarchical policies are evaluated every 25000 timesteps by averaging over 10 randomly seeded trials