When Waiting Is Not an Option: Learning Options With a Deliberation Cost

Authors: Jean Harb, Pierre-Luc Bacon, Martin Klissarov, Doina Precup

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our results in the Arcade Learning Environment (ALE) show increased performance and interpretability. We used Amidar, a game of the Atari 2600 suite, to analyze the option policies and terminations qualitatively.
Researcher Affiliation Academia Reasoning and Learning Lab, Mc Gill University {jharb,pbacon,mklissa,dprecup}@cs.mcgill.ca
Pseudocode Yes Algorithm 1: Asynchronous Advantage Option Critic
Open Source Code Yes 1The source code is available at https://github.com/jeanharb/ a2oc delib
Open Datasets Yes Arcade Learning Environment (Bellemare et al. 2013)
Dataset Splits No The paper describes data preprocessing and training parameters but does not provide specific train/validation/test dataset splits (e.g., percentages or sample counts).
Hardware Specification No The paper does not provide specific details about the hardware used for experiments (e.g., specific GPU/CPU models, memory, or cloud instance types).
Software Dependencies No The paper mentions algorithms like A3C and DQN and the use of convolutional neural networks, but it does not specify software dependencies with version numbers (e.g., Python version, specific deep learning frameworks like PyTorch or TensorFlow, and their versions).
Experiment Setup Yes As for the hyperparameters, we use an ϵ-greedy policy over options, with ϵ = 0.1. The preprocessing are the same as the A3C, with RGB pixels scaled to 84 84 grayscale images. The agent repeats actions for 4 consecutive moves and receives stacks of 4 frames as inputs. We used entropy regularization of 0.01, which pushes option policies not to collapse to deterministic policies. A learning rate of 0.0007 was used in all experiments.