HyAR: Addressing Discrete-Continuous Action Reinforcement Learning via Hybrid Action Representation

Authors: Boyan Li, Hongyao Tang, YAN ZHENG, Jianye HAO, Pengyi Li, Zhen Wang, Zhaopeng Meng, LI Wang

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Hy AR in a variety of environments with discrete-continuous action space. The results demonstrate the superiority of Hy AR when compared with previous baselines, especially for high-dimensional action spaces.
Researcher Affiliation Academia 1College of Intelligence and Computing, Tianjin University, 2School of Artificial Intelligence, Optics and Electronics (i OPEN) and School of Cybersecurity, Northwestern Polytechnical University
Pseudocode Yes Algorithm 1 describes the pseudo-code of Hy AR-TD3, containing two major stages: 1 warm-up stage and 2 training stage.
Open Source Code Yes For reproducibility, codes are provided in the supplementary material.
Open Datasets Yes Benchmarks Fig. 4 visualizes the evaluation benchmarks, including the Platform and Goal from (Masson et al., 2016), Catch Point from (Fan et al., 2019), and a newly designed Hard Move specific to the evaluation in larger hybrid action space. We also build a complex version of Goal, called Hard Goal. All benchmarks have hybrid actions and require the agent to select reasonable actions to complete the task. See complete description of benchmarks in Appendix B.1.
Dataset Splits No The paper describes a "warm-up stage" for pre-training representation models and a "training stage" for policy learning. It does not mention explicit validation set splits (e.g., 80/10/10) or usage for hyperparameter tuning in the conventional sense.
Hardware Specification Yes All experiments were run on a single NVIDIA GeForce GTX 2080Ti GPU.
Software Dependencies Yes Our codes are implemented with Python 3.7.9 and Torch 1.7.1.
Experiment Setup Yes Complete details of setups are provided in Appendix B. For all experiments, we give each baseline the same training budget. For our algorithms, we use a random strategy to interact with the environment for 5000 episodes during the warm-up stage. For each experiment, we run 5 trials and report the average results. Table 5 shows the common hyperparamters of algorithms used in all our experiments.