DEIR: Efficient and Robust Exploration through Discriminative-Model-Based Episodic Intrinsic Rewards

Authors: Shanchuan Wan, Yujin Tang, Yingtao Tian, Tomoyuki Kaneko

IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on both standard and advanced exploration tasks in Mini Grid show that DEIR quickly learns a better policy than the baselines. Our evaluations on Proc Gen demonstrate both the generalization capability and the general applicability of our intrinsic reward.
Researcher Affiliation Collaboration Shanchuan Wan1 , Yujin Tang2 , Yingtao Tian2 and Tomoyuki Kaneko1 1The University of Tokyo 2Google Research, Brain Team swan@game.c.u-tokyo.ac.jp, {yujintang, alantian}@google.com, kaneko@graco.c.u-tokyo.ac.jp
Pseudocode No The paper describes its proposed method and architecture in text and diagrams (Figure 2), but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes Our source code is available at https://github.com/swan-utokyo/deir.
Open Datasets Yes (1) Mini Grid [Chevalier Boisvert et al., 2018], which consists of 20 grid-world exploration games... (2) Proc Gen [Cobbe et al., 2019; Cobbe et al., 2020], which consists of 16 games with 64 64 3 RGB image inputs...
Dataset Splits No The paper refers to standard benchmark environments (Mini Grid, Proc Gen) which have their own implicit training/testing protocols, but it does not explicitly state specific validation dataset splits (e.g., percentages or explicit cross-validation methods) for reproducibility of data partitioning.
Hardware Specification No The paper does not provide specific hardware details such as exact GPU/CPU models, memory specifications, or cloud instance types used for running the experiments.
Software Dependencies No Our implementations are based on Stable Baselines 3 [Raffin et al., 2021] and the official code of existing methods (if available). The paper mentions software tools like Stable Baselines 3 but does not specify their version numbers or other crucial software dependencies with versions required for reproducibility.
Experiment Setup No We performed hyperparameter searches for every method involved in our experiments to ensure they have the best performance possible. and We also performed sensitivity analyses on two key hyperparameters of our method, namely, the maximum episode length and the maximum observation queue size. While hyperparameter tuning is mentioned, the paper does not explicitly state the specific values for these hyperparameters (e.g., learning rate, batch size, number of epochs) or other system-level training settings within the main text.