reproducibilityindex.ai

HyAR: Addressing Discrete-Continuous Action Reinforcement Learning via Hybrid Action Representation

Authors: Boyan Li, Hongyao Tang, YAN ZHENG, Jianye HAO, Pengyi Li, Zhen Wang, Zhaopeng Meng, LI Wang

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate Hy AR in a variety of environments with discrete-continuous action space. The results demonstrate the superiority of Hy AR when compared with previous baselines, especially for high-dimensional action spaces.
Researcher Affiliation	Academia	1College of Intelligence and Computing, Tianjin University, 2School of Artificial Intelligence, Optics and Electronics (i OPEN) and School of Cybersecurity, Northwestern Polytechnical University
Pseudocode	Yes	Algorithm 1 describes the pseudo-code of Hy AR-TD3, containing two major stages: 1 warm-up stage and 2 training stage.
Open Source Code	Yes	For reproducibility, codes are provided in the supplementary material.
Open Datasets	Yes	Benchmarks Fig. 4 visualizes the evaluation benchmarks, including the Platform and Goal from (Masson et al., 2016), Catch Point from (Fan et al., 2019), and a newly designed Hard Move speciﬁc to the evaluation in larger hybrid action space. We also build a complex version of Goal, called Hard Goal. All benchmarks have hybrid actions and require the agent to select reasonable actions to complete the task. See complete description of benchmarks in Appendix B.1.
Dataset Splits	No	The paper describes a "warm-up stage" for pre-training representation models and a "training stage" for policy learning. It does not mention explicit validation set splits (e.g., 80/10/10) or usage for hyperparameter tuning in the conventional sense.
Hardware Specification	Yes	All experiments were run on a single NVIDIA GeForce GTX 2080Ti GPU.
Software Dependencies	Yes	Our codes are implemented with Python 3.7.9 and Torch 1.7.1.
Experiment Setup	Yes	Complete details of setups are provided in Appendix B. For all experiments, we give each baseline the same training budget. For our algorithms, we use a random strategy to interact with the environment for 5000 episodes during the warm-up stage. For each experiment, we run 5 trials and report the average results. Table 5 shows the common hyperparamters of algorithms used in all our experiments.