A Geometric Perspective on Optimal Representations for Reinforcement Learning
Authors: Marc Bellemare, Will Dabney, Robert Dadashi, Adrien Ali Taiga, Pablo Samuel Castro, Nicolas Le Roux, Dale Schuurmans, Tor Lattimore, Clare Lyle
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We complement our theoretical results with an empirical study in a simple grid world environment, focusing on the use of deep learning techniques to learn representations. Our concrete instantiation (Algorithm 1) uses the representation loss (5). We perform all of our experiments within the four-room domain (Sutton et al., 1999; Solway et al., 2014; Machado et al., 2017, Figure 2, see also Appendix H.1). We report the quality of the learned policies after training, as a function of d, the size of the representation. Our quality measure is the average return from the designated start state (bottom left). Results are provided in Figure 4 and Figure 13 (appendix). |
| Researcher Affiliation | Collaboration | 1Google Research 2Deep Mind 3Mila, Universit e de Montr eal 4University of Alberta 5University of Oxford |
| Pseudocode | Yes | Algorithm 1 Representation learning using AVFs input k desired number of AVFs, d desired number of features. Sample δ1, . . . , δk [ 1, 1]n Compute µi = arg maxπ δ i V π using a policy gradient method Find φ = arg minφ L(φ; {V µ1, . . . , V µk}) (Equation 5) |
| Open Source Code | No | The paper does not provide an explicit statement of releasing code for the described methodology or a direct link to a source code repository for their implementation. |
| Open Datasets | Yes | We perform all of our experiments within the four-room domain (Sutton et al., 1999; Solway et al., 2014; Machado et al., 2017, Figure 2, see also Appendix H.1). |
| Dataset Splits | No | The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) as is typical for supervised learning datasets. As the research is in reinforcement learning, it focuses on agent interaction with an environment rather than pre-partitioned datasets. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions software like TensorFlow and Dopamine, and an optimizer like Rmsprop, but it does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | We consider a two-part network where we pretrain φ end-to-end to predict a set of value functions. We adapt the parameters of the deep network using Rmsprop (Tieleman and Hinton, 2012). We learn a d = 16 dimensional representation, not including the bias unit. We sample k = 1000 interest functions and use Algorithm 1 to generate k AVFs. We compare the value-based and AVF-based representations from the previous section (VALUE and AVF), and also proto-value functions (PVF; details in Appendix H.3). |