reproducibilityindex.ai

Revisiting Intrinsic Reward for Exploration in Procedurally Generated Environments

Authors: Kaixin Wang, Kuangqi Zhou, Bingyi Kang, Jiashi Feng, Shuicheng YAN

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To bridge this gap, we disentangle these two parts and conduct ablative experiments. We consider lifelong and episodic intrinsic rewards used in prior works, and compare the performance of all lifelong-episodic combinations on the commonly used Mini Grid benchmark. Experimental results show that only using episodic intrinsic rewards can match or surpass prior state-of-the-art methods.
Researcher Affiliation	Collaboration	Kaixin Wang National University of Singapore kaixin96.wang@gmail.com Kuangqi Zhou National University of Singapore kqzhou525@gmail.com Bingyi Kang Sea AI Lab bingykang@gmail.com Jiashi Feng Byte Dance jshfeng@gmail.com Shuicheng Yan Sea AI Lab yansc@sea.com
Pseudocode	No	The paper describes algorithms like ICM, RIDE, RND, and Be Bold but does not provide any pseudocode or algorithm blocks for its own methodology.
Open Source Code	Yes	To reproduce the results, we also include the source code in the supplementary material.
Open Datasets	Yes	Environments Following previous works (Raileanu & Rockt aschel, 2020; Zhang et al., 2020; Campero et al., 2021; Zha et al., 2021; Flet-Berliac et al., 2021) on exploration in procedurallygenerated gridworld environments, we use the Mini Grid benchmark (Chevalier-Boisvert et al., 2018), which runs fast and hence is suitable for large-scale experiments.
Dataset Splits	No	The paper states 'we use training curves for performance comparison' and 'The training curves are averaged over 5 runs', but it does not specify explicit dataset splits (e.g., percentages or sample counts) for training, validation, or testing.
Hardware Specification	Yes	All of our experiments can be run on a single machine with 8 CPUs and a Titan X GPU.
Software Dependencies	No	The paper mentions software like PPO and Adam, and the use of Torch Beast (a PyTorch platform), but it does not provide specific version numbers for these software dependencies (e.g., 'PyTorch 1.x' or 'PPO version X.Y').
Experiment Setup	Yes	The searched values of β as well as other hyperparameters are summarized in Appx. A.5. Table 1: Hyperparameters. Number of parallel environments 128, Number of timesteps per rollout 128, PPO clip range 0.2, Discount factor γ .99, GAE λ .95, Number of epochs 4, Number of minibatches per epoch 8, Entropy bonus coefficient 0.01, Value loss coefficient 0.5, Advantage normalization Yes, Gradient clipping (ℓ2 norm) 0.5, Learning rate 5 × 10−4.