A Geometric Perspective on Optimal Representations for Reinforcement Learning

Authors: Marc Bellemare, Will Dabney, Robert Dadashi, Adrien Ali Taiga, Pablo Samuel Castro, Nicolas Le Roux, Dale Schuurmans, Tor Lattimore, Clare Lyle

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We complement our theoretical results with an empirical study in a simple grid world environment, focusing on the use of deep learning techniques to learn representations. Our concrete instantiation (Algorithm 1) uses the representation loss (5). We perform all of our experiments within the four-room domain (Sutton et al., 1999; Solway et al., 2014; Machado et al., 2017, Figure 2, see also Appendix H.1). We report the quality of the learned policies after training, as a function of d, the size of the representation. Our quality measure is the average return from the designated start state (bottom left). Results are provided in Figure 4 and Figure 13 (appendix).
Researcher Affiliation Collaboration 1Google Research 2Deep Mind 3Mila, Universit e de Montr eal 4University of Alberta 5University of Oxford
Pseudocode Yes Algorithm 1 Representation learning using AVFs input k desired number of AVFs, d desired number of features. Sample δ1, . . . , δk [ 1, 1]n Compute µi = arg maxπ δ i V π using a policy gradient method Find φ = arg minφ L(φ; {V µ1, . . . , V µk}) (Equation 5)
Open Source Code No The paper does not provide an explicit statement of releasing code for the described methodology or a direct link to a source code repository for their implementation.
Open Datasets Yes We perform all of our experiments within the four-room domain (Sutton et al., 1999; Solway et al., 2014; Machado et al., 2017, Figure 2, see also Appendix H.1).
Dataset Splits No The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) as is typical for supervised learning datasets. As the research is in reinforcement learning, it focuses on agent interaction with an environment rather than pre-partitioned datasets.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper mentions software like TensorFlow and Dopamine, and an optimizer like Rmsprop, but it does not provide specific version numbers for these software components.
Experiment Setup Yes We consider a two-part network where we pretrain φ end-to-end to predict a set of value functions. We adapt the parameters of the deep network using Rmsprop (Tieleman and Hinton, 2012). We learn a d = 16 dimensional representation, not including the bias unit. We sample k = 1000 interest functions and use Algorithm 1 to generate k AVFs. We compare the value-based and AVF-based representations from the previous section (VALUE and AVF), and also proto-value functions (PVF; details in Appendix H.3).