Revisiting Intrinsic Reward for Exploration in Procedurally Generated Environments

Authors: Kaixin Wang, Kuangqi Zhou, Bingyi Kang, Jiashi Feng, Shuicheng YAN

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To bridge this gap, we disentangle these two parts and conduct ablative experiments. We consider lifelong and episodic intrinsic rewards used in prior works, and compare the performance of all lifelong-episodic combinations on the commonly used Mini Grid benchmark. Experimental results show that only using episodic intrinsic rewards can match or surpass prior state-of-the-art methods.
Researcher Affiliation Collaboration Kaixin Wang National University of Singapore kaixin96.wang@gmail.com Kuangqi Zhou National University of Singapore kqzhou525@gmail.com Bingyi Kang Sea AI Lab bingykang@gmail.com Jiashi Feng Byte Dance jshfeng@gmail.com Shuicheng Yan Sea AI Lab yansc@sea.com
Pseudocode No The paper describes algorithms like ICM, RIDE, RND, and Be Bold but does not provide any pseudocode or algorithm blocks for its own methodology.
Open Source Code Yes To reproduce the results, we also include the source code in the supplementary material.
Open Datasets Yes Environments Following previous works (Raileanu & Rockt aschel, 2020; Zhang et al., 2020; Campero et al., 2021; Zha et al., 2021; Flet-Berliac et al., 2021) on exploration in procedurallygenerated gridworld environments, we use the Mini Grid benchmark (Chevalier-Boisvert et al., 2018), which runs fast and hence is suitable for large-scale experiments.
Dataset Splits No The paper states 'we use training curves for performance comparison' and 'The training curves are averaged over 5 runs', but it does not specify explicit dataset splits (e.g., percentages or sample counts) for training, validation, or testing.
Hardware Specification Yes All of our experiments can be run on a single machine with 8 CPUs and a Titan X GPU.
Software Dependencies No The paper mentions software like PPO and Adam, and the use of Torch Beast (a PyTorch platform), but it does not provide specific version numbers for these software dependencies (e.g., 'PyTorch 1.x' or 'PPO version X.Y').
Experiment Setup Yes The searched values of β as well as other hyperparameters are summarized in Appx. A.5. Table 1: Hyperparameters. Number of parallel environments 128, Number of timesteps per rollout 128, PPO clip range 0.2, Discount factor γ .99, GAE λ .95, Number of epochs 4, Number of minibatches per epoch 8, Entropy bonus coefficient 0.01, Value loss coefficient 0.5, Advantage normalization Yes, Gradient clipping (ℓ2 norm) 0.5, Learning rate 5 × 10−4.