Hindsight Trust Region Policy Optimization

Authors: Hanbo Zhang, Site Bai, Xuguang Lan, David Hsu, Nanning Zheng

IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental HTRPO has been evaluated on various sparse-reward tasks, including Atari games and simulated robot control. Results show that HTRPO consistently outperforms TRPO, as well as HPG, a state-of-the-art policy gradient algorithm for RL with sparse rewards.
Researcher Affiliation Academia 1Xi an Jiaotong University 2National University of Singapore {zhanghanbo163, best99317}@stu.xjtu.edu.cn, xglan@xjtu.edu.cn, dyhsu@comp.nus.edu.sg, nnzheng@xjtu.edu.cn
Pseudocode Yes The complete algorithm of HGF and HTRPO is presented in Appendix E.
Open Source Code No The paper references third-party baselines (Open AI baselines) but does not provide a specific link or explicit statement for the open-source code of their proposed HTRPO method.
Open Datasets Yes Firstly, we test HTRPO in simple benchmarks established in previous work [Andrychowicz et al., 2017] including 4-to100-Bit Flipping tasks. Secondly, We verify HTRPO s performance in Atari games like Ms. Pac-Man [Bellemare et al., 2013] with complex raw image input... Finally, we test HTRPO in simulated robot control tasks like Reach, Push, Slide and Pick And Place in Fetch [Plappert et al., 2018] robot environment.
Dataset Splits No The paper mentions using benchmark environments but does not explicitly describe validation data splits or usage.
Hardware Specification Yes All experiments are conducted on a platform with NVIDIA Ge Force GTX 1080Ti.
Software Dependencies No The paper mentions using DQN and DDPG based on Open AI baselines but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes Detailed settings of hyperparameters are listed in Appendix F.2.