Compress and Control
Authors: Joel Veness, Marc Bellemare, Marcus Hutter, Alvin Chua, Guillaume Desjardins
AAAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We also study the behavior of this technique when applied to various Atari 2600 video games, where the use of suboptimal modeling techniques is unavoidable. We consider three fundamentally different models, all too limited to perfectly model the dynamics of the system. Remarkably, we find that our technique provides sufficiently accurate value estimates for effective on-policy control. |
| Researcher Affiliation | Collaboration | Joel Veness, Marc G. Bellemare, Marcus Hutter, Alvin Chua, Guillaume Desjardins Google Deep Mind, Australian National University {veness,bellemare,alschua,gdesjardins}@google.com marcus.hutter@anu.edu.au |
| Pseudocode | Yes | Algorithm 1 CNC POLICY EVALUATION |
| Open Source Code | No | The paper does not contain any statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | Our first experiment involves a simplified version of the game of Blackjack (Sutton and Barto 1998, Section 5.1). We evaluated CNC using ALE, the Arcade Learning Environment (Bellemare et al. 2013), a reinforcement learning interface to the Atari 2600 video game platform. |
| Dataset Splits | No | The paper describes hyperparameter optimization but does not specify explicit training/validation/test dataset splits, which are not typical for reinforcement learning environments where agents interact directly with the environment. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, or memory) used to run the experiments, only mentioning the use of the Stella Atari 2600 emulator for the environment. |
| Software Dependencies | No | The paper mentions several software components, estimators, and algorithms such as ALE, SAD estimator, logistic regression, ADAGRAD, LEMPEL-ZIV, and SKIPCTS, but it does not specify concrete version numbers for any of these to ensure reproducible setup. |
| Experiment Setup | Yes | The exploration rate ϵ was initialized to 1.0, then decayed linearly to 0.02 over the course of 200,000 time steps. The horizon was set to m = 80 steps, corresponding to roughly 5 seconds of play. The agents were evaluated over 10 trials, each lasting 2 million steps. The hyperparameters (including learning rate, choice of context, etc.) were optimized via the random sampling technique of Bergstra and Bengio (2012). |