Regret Minimization for Partially Observable Deep Reinforcement Learning

Authors: Peter Jin, Kurt Keutzer, Sergey Levine

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that this new algorithm can substantially outperform strong baseline methods on several partially observed reinforcement learning tasks: learning first-person 3D navigation in Doom and Minecraft, and acting in the presence of partially observed objects in Doom and Pong.
Researcher Affiliation Academia 1Department of Electrical Engineering and Computer Sciences, University of California, Berkeley. Correspondence to: Peter Jin <phj@eecs.berkeley.edu>.
Pseudocode Yes Algorithm 1 Advantage-based regret minimization (ARM).
Open Source Code No No explicit statement or link providing access to the open-source code for the described methodology was found.
Open Datasets Yes Doom and Minecraft (Kempka et al., 2016; Johnson et al., 2016), Atari Pong via the Arcade Learning Environment (Bellemare et al., 2013).
Dataset Splits No The paper describes curriculum learning schedules and simulator steps for training and evaluation within dynamic environments (games) rather than providing specific train/validation/test dataset splits (percentages or counts) of a static dataset.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running the experiments.
Software Dependencies No The paper mentions environments like Vi ZDoom and the Arcade Learning Environment, and an optimizer (Adam), but does not provide specific software dependency names with version numbers (e.g., Python, TensorFlow, PyTorch versions) needed for replication.
Experiment Setup Yes Our hyperparameters are listed in Section A1 of the Supplementary Material. and despite both methods using biased n-step returns with the same value of n (n = 5). and By default, all of our networks receive as input a frame history of length 4. and aggressive fast schedule of only 62500 simulator steps between levels. TRPO required a slow schedule of 93750 simulator steps between levels