Compress and Control

Authors: Joel Veness, Marc Bellemare, Marcus Hutter, Alvin Chua, Guillaume Desjardins

AAAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We also study the behavior of this technique when applied to various Atari 2600 video games, where the use of suboptimal modeling techniques is unavoidable. We consider three fundamentally different models, all too limited to perfectly model the dynamics of the system. Remarkably, we find that our technique provides sufficiently accurate value estimates for effective on-policy control.
Researcher Affiliation Collaboration Joel Veness, Marc G. Bellemare, Marcus Hutter, Alvin Chua, Guillaume Desjardins Google Deep Mind, Australian National University {veness,bellemare,alschua,gdesjardins}@google.com marcus.hutter@anu.edu.au
Pseudocode Yes Algorithm 1 CNC POLICY EVALUATION
Open Source Code No The paper does not contain any statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets Yes Our first experiment involves a simplified version of the game of Blackjack (Sutton and Barto 1998, Section 5.1). We evaluated CNC using ALE, the Arcade Learning Environment (Bellemare et al. 2013), a reinforcement learning interface to the Atari 2600 video game platform.
Dataset Splits No The paper describes hyperparameter optimization but does not specify explicit training/validation/test dataset splits, which are not typical for reinforcement learning environments where agents interact directly with the environment.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, or memory) used to run the experiments, only mentioning the use of the Stella Atari 2600 emulator for the environment.
Software Dependencies No The paper mentions several software components, estimators, and algorithms such as ALE, SAD estimator, logistic regression, ADAGRAD, LEMPEL-ZIV, and SKIPCTS, but it does not specify concrete version numbers for any of these to ensure reproducible setup.
Experiment Setup Yes The exploration rate ϵ was initialized to 1.0, then decayed linearly to 0.02 over the course of 200,000 time steps. The horizon was set to m = 80 steps, corresponding to roughly 5 seconds of play. The agents were evaluated over 10 trials, each lasting 2 million steps. The hyperparameters (including learning rate, choice of context, etc.) were optimized via the random sampling technique of Bergstra and Bengio (2012).