Contingency-Aware Exploration in Reinforcement Learning

Authors: Jongwook Choi, Yijie Guo, Marcin Moczulski, Junhyuk Oh, Neal Wu, Mohammad Norouzi, Honglak Lee

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper investigates whether learning contingency-awareness and controllable aspects of an environment can lead to better exploration in reinforcement learning. To investigate this question, we consider an instantiation of this hypothesis evaluated on the Arcade Learning Element (ALE). In this study, we develop an attentive dynamics model (ADM) that discovers controllable elements of the observations... We demonstrate that combining actor-critic algorithm with count-based exploration using our representation achieves impressive results on a set of notoriously challenging Atari games due to sparse rewards.
Researcher Affiliation Collaboration Jongwook Choi ,1 Yijie Guo ,1 Marcin Moczulski ,2 Junhyuk Oh1, Neal Wu2 Mohammad Norouzi2 Honglak Lee2,1 1University of Michigan 2Google Brain {jwook,guoyijie}@umich.edu moczulski@google.com {junhyuk,nealwu,mnorouzi,honglak}@google.com
Pseudocode Yes Algorithm 1 A2C+Co EX
Open Source Code No The paper provides a link "https://coex-rl.github.io/" for "Examples of the learned policy and the contingent regions" and "A demo video of the learnt policy and localization". This link appears to be for demonstrations and results, not for the source code of the methodology itself.
Open Datasets Yes We evaluate the proposed exploration strategy on several difficult exploration Atari 2600 games from the Arcade Learning Environment (ALE) (Bellemare et al., 2013). We focus on 8 Atari games including FREEWAY, FROSTBITE, HERO, PRIVATEEYE, MONTEZUMA S REVENGE, QBERT, SEAQUEST, and VENTURE.
Dataset Splits No The paper describes training processes (e.g., "100M steps of training", "500M environment steps") and reports performance metrics, but it does not specify explicit validation dataset splits as percentages or sample counts.
Hardware Specification No The paper mentions training on various frameworks (A2C, PPO) and using parallel actors, but does not provide specific details on the hardware (e.g., GPU models, CPU types, memory) used for running the experiments.
Software Dependencies No The paper mentions using "Open AI baselines (Dhariwal et al., 2017)" and optimizers like "RMSProp" and "Adam", but it does not provide specific version numbers for these software components or other key libraries (e.g., Python, PyTorch).
Experiment Setup Yes Table 4: Network architecture and hyperparameters; Table 5: The list of hyperparameters used for A2C+Co EX in each game... For the A2C (Mnih et al., 2016) algorithm, we use 16 parallel actors to collect the agent s experience, with 5-step rollout, which yields a minibatch of size 80 for on-policy transitions. We use the last 4 observation frames stacked as input, each of which is resized to 84 84 and converted to grayscale...