Towards Interpretable Reinforcement Learning Using Attention Augmented Agents

Authors: Alexander Mott, Daniel Zoran, Mike Chrzanowski, Daan Wierstra, Danilo Jimenez Rezende

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We use the Arcade Learning Environment [33] to train and test our agent on 57 different Atari games. The model uses a 3 layer convolutional neural network followed by a convolutional LSTM as the vision core. We compare against two models without attentional bottlenecks to benchmark performance, both using the deeper residual network described in [34]. We find that our agent is competitive with these state-of-the-art baselines, see Table 1 for benchmark results and Appendix A.3 for learning curves and performance on individual levels.
Researcher Affiliation Industry Alex Mott* Daniel Zoran* Mike Chrzanowski Daan Wierstra Danilo J. Rezende Deep Mind London, UK {alexmott,danielzoran,chrzanowski,wierstra,danilor}@google.com
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code No The paper states:
Open Datasets Yes We use the Arcade Learning Environment [33] to train and test our agent on 57 different Atari games.
Dataset Splits No The paper mentions training and testing but does not explicitly provide details about specific training, validation, and test splits (e.g., percentages or sample counts) for reproducibility, nor does it cite standard splits with precise identification.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU/CPU models, memory specifications).
Software Dependencies No The paper mentions
Experiment Setup Yes The model uses a 3 layer convolutional neural network followed by a convolutional LSTM as the vision core. Another (fully connected) LSTM generates a policy and a baseline function V as output; it takes as input the query and answer vectors, the previous reward and a one-hot encoding of the previous action. The query network is a three layer MLP, which takes as input the hidden state h of the LSTM from the previous time step and produces 4 attention queries. See Appendix A.1 for a full specification of the network sizes. We use the Importance Weighted Actor-Learner Architecture [34] training architecture to train our agents. We use an actor-critic setup and a VTRACE loss with an RMSProp optimizer (see learning parameters in Appendix A.1 for more details).