Revisiting Intrinsic Reward for Exploration in Procedurally Generated Environments
Authors: Kaixin Wang, Kuangqi Zhou, Bingyi Kang, Jiashi Feng, Shuicheng YAN
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To bridge this gap, we disentangle these two parts and conduct ablative experiments. We consider lifelong and episodic intrinsic rewards used in prior works, and compare the performance of all lifelong-episodic combinations on the commonly used Mini Grid benchmark. Experimental results show that only using episodic intrinsic rewards can match or surpass prior state-of-the-art methods. |
| Researcher Affiliation | Collaboration | Kaixin Wang National University of Singapore kaixin96.wang@gmail.com Kuangqi Zhou National University of Singapore kqzhou525@gmail.com Bingyi Kang Sea AI Lab bingykang@gmail.com Jiashi Feng Byte Dance jshfeng@gmail.com Shuicheng Yan Sea AI Lab yansc@sea.com |
| Pseudocode | No | The paper describes algorithms like ICM, RIDE, RND, and Be Bold but does not provide any pseudocode or algorithm blocks for its own methodology. |
| Open Source Code | Yes | To reproduce the results, we also include the source code in the supplementary material. |
| Open Datasets | Yes | Environments Following previous works (Raileanu & Rockt aschel, 2020; Zhang et al., 2020; Campero et al., 2021; Zha et al., 2021; Flet-Berliac et al., 2021) on exploration in procedurallygenerated gridworld environments, we use the Mini Grid benchmark (Chevalier-Boisvert et al., 2018), which runs fast and hence is suitable for large-scale experiments. |
| Dataset Splits | No | The paper states 'we use training curves for performance comparison' and 'The training curves are averaged over 5 runs', but it does not specify explicit dataset splits (e.g., percentages or sample counts) for training, validation, or testing. |
| Hardware Specification | Yes | All of our experiments can be run on a single machine with 8 CPUs and a Titan X GPU. |
| Software Dependencies | No | The paper mentions software like PPO and Adam, and the use of Torch Beast (a PyTorch platform), but it does not provide specific version numbers for these software dependencies (e.g., 'PyTorch 1.x' or 'PPO version X.Y'). |
| Experiment Setup | Yes | The searched values of β as well as other hyperparameters are summarized in Appx. A.5. Table 1: Hyperparameters. Number of parallel environments 128, Number of timesteps per rollout 128, PPO clip range 0.2, Discount factor γ .99, GAE λ .95, Number of epochs 4, Number of minibatches per epoch 8, Entropy bonus coefficient 0.01, Value loss coefficient 0.5, Advantage normalization Yes, Gradient clipping (ℓ2 norm) 0.5, Learning rate 5 × 10−4. |