reproducibilityindex.ai

Regret Minimization for Partially Observable Deep Reinforcement Learning

Authors: Peter Jin, Kurt Keutzer, Sergey Levine

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that this new algorithm can substantially outperform strong baseline methods on several partially observed reinforcement learning tasks: learning ﬁrst-person 3D navigation in Doom and Minecraft, and acting in the presence of partially observed objects in Doom and Pong.
Researcher Affiliation	Academia	1Department of Electrical Engineering and Computer Sciences, University of California, Berkeley. Correspondence to: Peter Jin <phj@eecs.berkeley.edu>.
Pseudocode	Yes	Algorithm 1 Advantage-based regret minimization (ARM).
Open Source Code	No	No explicit statement or link providing access to the open-source code for the described methodology was found.
Open Datasets	Yes	Doom and Minecraft (Kempka et al., 2016; Johnson et al., 2016), Atari Pong via the Arcade Learning Environment (Bellemare et al., 2013).
Dataset Splits	No	The paper describes curriculum learning schedules and simulator steps for training and evaluation within dynamic environments (games) rather than providing specific train/validation/test dataset splits (percentages or counts) of a static dataset.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running the experiments.
Software Dependencies	No	The paper mentions environments like Vi ZDoom and the Arcade Learning Environment, and an optimizer (Adam), but does not provide specific software dependency names with version numbers (e.g., Python, TensorFlow, PyTorch versions) needed for replication.
Experiment Setup	Yes	Our hyperparameters are listed in Section A1 of the Supplementary Material. and despite both methods using biased n-step returns with the same value of n (n = 5). and By default, all of our networks receive as input a frame history of length 4. and aggressive fast schedule of only 62500 simulator steps between levels. TRPO required a slow schedule of 93750 simulator steps between levels