DEIR: Efficient and Robust Exploration through Discriminative-Model-Based Episodic Intrinsic Rewards
Authors: Shanchuan Wan, Yujin Tang, Yingtao Tian, Tomoyuki Kaneko
IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on both standard and advanced exploration tasks in Mini Grid show that DEIR quickly learns a better policy than the baselines. Our evaluations on Proc Gen demonstrate both the generalization capability and the general applicability of our intrinsic reward. |
| Researcher Affiliation | Collaboration | Shanchuan Wan1 , Yujin Tang2 , Yingtao Tian2 and Tomoyuki Kaneko1 1The University of Tokyo 2Google Research, Brain Team swan@game.c.u-tokyo.ac.jp, {yujintang, alantian}@google.com, kaneko@graco.c.u-tokyo.ac.jp |
| Pseudocode | No | The paper describes its proposed method and architecture in text and diagrams (Figure 2), but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | Yes | Our source code is available at https://github.com/swan-utokyo/deir. |
| Open Datasets | Yes | (1) Mini Grid [Chevalier Boisvert et al., 2018], which consists of 20 grid-world exploration games... (2) Proc Gen [Cobbe et al., 2019; Cobbe et al., 2020], which consists of 16 games with 64 64 3 RGB image inputs... |
| Dataset Splits | No | The paper refers to standard benchmark environments (Mini Grid, Proc Gen) which have their own implicit training/testing protocols, but it does not explicitly state specific validation dataset splits (e.g., percentages or explicit cross-validation methods) for reproducibility of data partitioning. |
| Hardware Specification | No | The paper does not provide specific hardware details such as exact GPU/CPU models, memory specifications, or cloud instance types used for running the experiments. |
| Software Dependencies | No | Our implementations are based on Stable Baselines 3 [Raffin et al., 2021] and the official code of existing methods (if available). The paper mentions software tools like Stable Baselines 3 but does not specify their version numbers or other crucial software dependencies with versions required for reproducibility. |
| Experiment Setup | No | We performed hyperparameter searches for every method involved in our experiments to ensure they have the best performance possible. and We also performed sensitivity analyses on two key hyperparameters of our method, namely, the maximum episode length and the maximum observation queue size. While hyperparameter tuning is mentioned, the paper does not explicitly state the specific values for these hyperparameters (e.g., learning rate, batch size, number of epochs) or other system-level training settings within the main text. |