Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Revisiting Intrinsic Reward for Exploration in Procedurally Generated Environments
Authors: Kaixin Wang, Kuangqi Zhou, Bingyi Kang, Jiashi Feng, Shuicheng YAN
ICLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To bridge this gap, we disentangle these two parts and conduct ablative experiments. We consider lifelong and episodic intrinsic rewards used in prior works, and compare the performance of all lifelong-episodic combinations on the commonly used Mini Grid benchmark. Experimental results show that only using episodic intrinsic rewards can match or surpass prior state-of-the-art methods. |
| Researcher Affiliation | Collaboration | Kaixin Wang National University of Singapore EMAIL Kuangqi Zhou National University of Singapore EMAIL Bingyi Kang Sea AI Lab EMAIL Jiashi Feng Byte Dance EMAIL Shuicheng Yan Sea AI Lab EMAIL |
| Pseudocode | No | The paper describes algorithms like ICM, RIDE, RND, and Be Bold but does not provide any pseudocode or algorithm blocks for its own methodology. |
| Open Source Code | Yes | To reproduce the results, we also include the source code in the supplementary material. |
| Open Datasets | Yes | Environments Following previous works (Raileanu & Rockt aschel, 2020; Zhang et al., 2020; Campero et al., 2021; Zha et al., 2021; Flet-Berliac et al., 2021) on exploration in procedurallygenerated gridworld environments, we use the Mini Grid benchmark (Chevalier-Boisvert et al., 2018), which runs fast and hence is suitable for large-scale experiments. |
| Dataset Splits | No | The paper states 'we use training curves for performance comparison' and 'The training curves are averaged over 5 runs', but it does not specify explicit dataset splits (e.g., percentages or sample counts) for training, validation, or testing. |
| Hardware Specification | Yes | All of our experiments can be run on a single machine with 8 CPUs and a Titan X GPU. |
| Software Dependencies | No | The paper mentions software like PPO and Adam, and the use of Torch Beast (a PyTorch platform), but it does not provide specific version numbers for these software dependencies (e.g., 'PyTorch 1.x' or 'PPO version X.Y'). |
| Experiment Setup | Yes | The searched values of β as well as other hyperparameters are summarized in Appx. A.5. Table 1: Hyperparameters. Number of parallel environments 128, Number of timesteps per rollout 128, PPO clip range 0.2, Discount factor γ .99, GAE λ .95, Number of epochs 4, Number of minibatches per epoch 8, Entropy bonus coefficient 0.01, Value loss coefficient 0.5, Advantage normalization Yes, Gradient clipping (ℓ2 norm) 0.5, Learning rate 5 × 10−4. |