Regret Minimization for Partially Observable Deep Reinforcement Learning
Authors: Peter Jin, Kurt Keutzer, Sergey Levine
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that this new algorithm can substantially outperform strong baseline methods on several partially observed reinforcement learning tasks: learning first-person 3D navigation in Doom and Minecraft, and acting in the presence of partially observed objects in Doom and Pong. |
| Researcher Affiliation | Academia | 1Department of Electrical Engineering and Computer Sciences, University of California, Berkeley. Correspondence to: Peter Jin <phj@eecs.berkeley.edu>. |
| Pseudocode | Yes | Algorithm 1 Advantage-based regret minimization (ARM). |
| Open Source Code | No | No explicit statement or link providing access to the open-source code for the described methodology was found. |
| Open Datasets | Yes | Doom and Minecraft (Kempka et al., 2016; Johnson et al., 2016), Atari Pong via the Arcade Learning Environment (Bellemare et al., 2013). |
| Dataset Splits | No | The paper describes curriculum learning schedules and simulator steps for training and evaluation within dynamic environments (games) rather than providing specific train/validation/test dataset splits (percentages or counts) of a static dataset. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running the experiments. |
| Software Dependencies | No | The paper mentions environments like Vi ZDoom and the Arcade Learning Environment, and an optimizer (Adam), but does not provide specific software dependency names with version numbers (e.g., Python, TensorFlow, PyTorch versions) needed for replication. |
| Experiment Setup | Yes | Our hyperparameters are listed in Section A1 of the Supplementary Material. and despite both methods using biased n-step returns with the same value of n (n = 5). and By default, all of our networks receive as input a frame history of length 4. and aggressive fast schedule of only 62500 simulator steps between levels. TRPO required a slow schedule of 93750 simulator steps between levels |