reproducibilityindex.ai

Variational oracle guiding for reinforcement learning

Authors: Dongqi Han, Tadashi Kozuno, Xufang Luo, Zhao-Yun Chen, Kenji Doya, Yuqing Yang, Dongsheng Li

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically demonstrate the effectiveness of VLOG in online and ofﬂine RL domains with tasks ranging from video games to a challenging tile-based game Mahjong. Furthermore, we publish the Mahjong environment and an ofﬂine RL dataset as a benchmark to facilitate future research on oracle guiding1. [...] We empirically show that VLOG contributes to better performance in a variety of decision-making tasks in both online and ofﬂine RL domains.
Researcher Affiliation	Collaboration	Dongqi Han 1, Tadashi Kozuno2, Xufang Luo3, Zhaoyun Chen4, Kenji Doya1, Yuqing Yang3, and Dongsheng Li3 1Okinawa Institute of Science and Technology 2University of Alberta 3Microsoft Research Asia 4Institute of Artiﬁcial Intelligence, Hefei Comprehensive National Science Center
Pseudocode	No	The paper does not contain a pseudocode block or algorithm block. It describes the neural network structure and algorithm logic in text.
Open Source Code	Yes	The source code of VLOG can be found in Supplementary Material. [...] The Mahjong environment we used in this papers is available on https://github.com/pymahjong/pymahjong for reproducibility. However, we recommend to use the newer version https://github.com/Agony5757/mahjong which is better-supported by the authors and much faster.
Open Datasets	Yes	We processed about 23M steps of human experts plays from the online Mahjong game platform Tenhou (https://tenhou.net/mjlog.html) to a dataset for ofﬂine RL (data were augmented using the symmetry in Mahjong, see Appendix F). [...] Finally, we publish the dataset of Mahjong for ofﬂine RL and the corresponding RL environment so as to facilitate future research on oracle guiding.
Dataset Splits	No	The paper does not explicitly provide percentages or counts for training, validation, and test splits. It mentions 'ofﬂine RL dataset' but not how it was split.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions using PyTorch in the pseudocode (Appendix B.1.1) but does not provide specific version numbers for PyTorch or other key software dependencies.
Experiment Setup	Yes	As DRL is susceptible to the choice of hyper-parameters, introducing any new hyper-parameters might obscure the effect of oracle guiding. Double DQN and dueling architecture are preferable for the base algorithm since they require no additional hyper-parameters, in contrast to other DQN variants (Hessel et al., 2018), such as prioritized experience replay (Schaul et al., 2016), noisy network (Fortunato et al., 2018), categorical DQN (Bellemare et al., 2017), and distributed RL (Kapturowski et al., 2018). Importantly, we used the same hyper-parameter setting for all methods and environments as much as possible (see Appendix B.2). [...] We summarize the hyper-parameters in Table 3.