PIPER: Primitive-Informed Preference-based Hierarchical Reinforcement Learning via Hindsight Relabeling

Authors: Utsav Singh, Wesley A Suttle, Brian M. Sadler, Vinay P. Namboodiri, Amrit Bedi

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform extensive experiments to show that PIPER mitigates non-stationarity in hierarchical reinforcement learning and achieves greater than 50% success rates in challenging, sparse-reward robotic environments, where most other baselines fail to achieve any significant progress.
Researcher Affiliation Academia 1CSE dept., IIT Kanpur, Kanpur, India. 2U.S. Army Research Laboratory, Adelphi, MD, USA. 3University of Texas, Austin, Texas, USA. 4Department of Computer Science, University of Bath, Bath, UK. 5CS dept., University of Central Florida, Orlando, Florida, USA.
Pseudocode Yes Pseudo-code for PIPER is provided in Algorithm 1.
Open Source Code Yes The implementation code and data is provided here.
Open Datasets Yes We evaluate PIPER on four robotic navigation and manipulation tasks: (i) maze navigation, (ii) pick and place (Andrychowicz et al., 2017), (iii) push, (iv) hollow, and (v) franka kitchen (Gupta et al., 2019).
Dataset Splits No The paper does not explicitly specify dataset splits (e.g., percentages or sample counts) for training, validation, or testing.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for running experiments.
Software Dependencies No The paper mentions using 'Soft Actor Critic (SAC)' and 'Adam optimizer' but does not provide specific version numbers for these or any other software libraries or dependencies.
Experiment Setup Yes Here, we enlist the additional hyper-parameters used in PIPER: activation: tanh [activation for reward model] layers: 3 [number of layers in the critic/actor networks] hidden: 512 [number of neurons in each hidden layers] Q lr: 0.001 [critic learning rate] pi lr: 0.001 [actor learning rate] buffer size: int(1E7) [for experience replay] tau: 0.8 [polyak averaging coefficient] clip obs: 200 [clip observation] n cycles: 1 [per epoch] n batches: 10 [training batches per cycle] batch size: 1024 [batch size hyper-parameter] reward batch size: 50 [reward batch size for PEBBLE and RFLAT] random eps: 0.2 [percentage of time a random action is taken] alpha: 0.05 [weightage parameter for SAC] noise eps: 0.05 [std of gaussian noise added to not-completely-random actions] norm eps: 0.01 [epsilon used for observation normalization] norm clip: 5 [normalized observations are cropped to this values] adam beta1: 0.9 [beta 1 for Adam optimizer] adam beta2: 0.999 [beta 2 for Adam optimizer]