PIPER: Primitive-Informed Preference-based Hierarchical Reinforcement Learning via Hindsight Relabeling
Authors: Utsav Singh, Wesley A Suttle, Brian M. Sadler, Vinay P. Namboodiri, Amrit Bedi
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform extensive experiments to show that PIPER mitigates non-stationarity in hierarchical reinforcement learning and achieves greater than 50% success rates in challenging, sparse-reward robotic environments, where most other baselines fail to achieve any significant progress. |
| Researcher Affiliation | Academia | 1CSE dept., IIT Kanpur, Kanpur, India. 2U.S. Army Research Laboratory, Adelphi, MD, USA. 3University of Texas, Austin, Texas, USA. 4Department of Computer Science, University of Bath, Bath, UK. 5CS dept., University of Central Florida, Orlando, Florida, USA. |
| Pseudocode | Yes | Pseudo-code for PIPER is provided in Algorithm 1. |
| Open Source Code | Yes | The implementation code and data is provided here. |
| Open Datasets | Yes | We evaluate PIPER on four robotic navigation and manipulation tasks: (i) maze navigation, (ii) pick and place (Andrychowicz et al., 2017), (iii) push, (iv) hollow, and (v) franka kitchen (Gupta et al., 2019). |
| Dataset Splits | No | The paper does not explicitly specify dataset splits (e.g., percentages or sample counts) for training, validation, or testing. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for running experiments. |
| Software Dependencies | No | The paper mentions using 'Soft Actor Critic (SAC)' and 'Adam optimizer' but does not provide specific version numbers for these or any other software libraries or dependencies. |
| Experiment Setup | Yes | Here, we enlist the additional hyper-parameters used in PIPER: activation: tanh [activation for reward model] layers: 3 [number of layers in the critic/actor networks] hidden: 512 [number of neurons in each hidden layers] Q lr: 0.001 [critic learning rate] pi lr: 0.001 [actor learning rate] buffer size: int(1E7) [for experience replay] tau: 0.8 [polyak averaging coefficient] clip obs: 200 [clip observation] n cycles: 1 [per epoch] n batches: 10 [training batches per cycle] batch size: 1024 [batch size hyper-parameter] reward batch size: 50 [reward batch size for PEBBLE and RFLAT] random eps: 0.2 [percentage of time a random action is taken] alpha: 0.05 [weightage parameter for SAC] noise eps: 0.05 [std of gaussian noise added to not-completely-random actions] norm eps: 0.01 [epsilon used for observation normalization] norm clip: 5 [normalized observations are cropped to this values] adam beta1: 0.9 [beta 1 for Adam optimizer] adam beta2: 0.999 [beta 2 for Adam optimizer] |