Variational oracle guiding for reinforcement learning

Authors: Dongqi Han, Tadashi Kozuno, Xufang Luo, Zhao-Yun Chen, Kenji Doya, Yuqing Yang, Dongsheng Li

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically demonstrate the effectiveness of VLOG in online and offline RL domains with tasks ranging from video games to a challenging tile-based game Mahjong. Furthermore, we publish the Mahjong environment and an offline RL dataset as a benchmark to facilitate future research on oracle guiding1. [...] We empirically show that VLOG contributes to better performance in a variety of decision-making tasks in both online and offline RL domains.
Researcher Affiliation Collaboration Dongqi Han 1, Tadashi Kozuno2, Xufang Luo3, Zhaoyun Chen4, Kenji Doya1, Yuqing Yang3, and Dongsheng Li3 1Okinawa Institute of Science and Technology 2University of Alberta 3Microsoft Research Asia 4Institute of Artificial Intelligence, Hefei Comprehensive National Science Center
Pseudocode No The paper does not contain a pseudocode block or algorithm block. It describes the neural network structure and algorithm logic in text.
Open Source Code Yes The source code of VLOG can be found in Supplementary Material. [...] The Mahjong environment we used in this papers is available on https://github.com/pymahjong/pymahjong for reproducibility. However, we recommend to use the newer version https://github.com/Agony5757/mahjong which is better-supported by the authors and much faster.
Open Datasets Yes We processed about 23M steps of human experts plays from the online Mahjong game platform Tenhou (https://tenhou.net/mjlog.html) to a dataset for offline RL (data were augmented using the symmetry in Mahjong, see Appendix F). [...] Finally, we publish the dataset of Mahjong for offline RL and the corresponding RL environment so as to facilitate future research on oracle guiding.
Dataset Splits No The paper does not explicitly provide percentages or counts for training, validation, and test splits. It mentions 'offline RL dataset' but not how it was split.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions using PyTorch in the pseudocode (Appendix B.1.1) but does not provide specific version numbers for PyTorch or other key software dependencies.
Experiment Setup Yes As DRL is susceptible to the choice of hyper-parameters, introducing any new hyper-parameters might obscure the effect of oracle guiding. Double DQN and dueling architecture are preferable for the base algorithm since they require no additional hyper-parameters, in contrast to other DQN variants (Hessel et al., 2018), such as prioritized experience replay (Schaul et al., 2016), noisy network (Fortunato et al., 2018), categorical DQN (Bellemare et al., 2017), and distributed RL (Kapturowski et al., 2018). Importantly, we used the same hyper-parameter setting for all methods and environments as much as possible (see Appendix B.2). [...] We summarize the hyper-parameters in Table 3.