Universal Value Function Approximators
Authors: Tom Schaul, Daniel Horgan, Karol Gregor, David Silver
ICML 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We ran several experiments to investigate the generalisation capabilities of UVFAs. In each case, the scenario is one of supervised learning, where the ground truth values V g (s) or Q g(s, a) are only given for some training set of pairs (s, g). We trained a UVFA on that data, and evaluated its generalisation capability in two ways. First, we measured the prediction error (MSE) on the value of a held-out set of unseen (s, g) pairs. Second, we measured the policy quality of a value function approximator Q(s, a, g; ) to be the true expected discounted reward according to its goal g, averaged over all start states, when following the soft-max policy of these values with temperature . |
| Researcher Affiliation | Industry | Tom Schaul SCHAUL@GOOGLE.COM Dan Horgan HORGAN@GOOGLE.COM Karol Gregor KAROLG@GOOGLE.COM David Silver DAVIDSILVER@GOOGLE.COM Google Deep Mind, 5 New Street Square, EC4A 3TW London |
| Pseudocode | Yes | Algorithm 1 UVFA learning from Horde targets |
| Open Source Code | No | The paper mentions using existing libraries and frameworks like Torch7 and Adam, but does not provide any explicit statement or link for the open-source code of its own methodology. |
| Open Datasets | Yes | On the Atari game of Ms Pacman, we then demonstrate that UVFAs can scale to larger visual input spaces and different types of goals... We used a hand-crafted goal space G: for each pellet on the screen, we defined eating it as an individual goal g R2, which is represented by the pellet s (x, y) coordinate on-screen. Following Algorithm 1, a Horde with 150 demons was trained. Each demon processed the visual input directly from the screen (see Appendix C for further experimental details). |
| Dataset Splits | Yes | We trained a UVFA on that data, and evaluated its generalisation capability in two ways. First, we measured the prediction error (MSE) on the value of a held-out set of unseen (s, g) pairs. Second, we measured the policy quality of a value function approximator Q(s, a, g; ) to be the true expected discounted reward according to its goal g, averaged over all start states, when following the soft-max policy of these values with temperature . |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper references Torch7 and Adam as relevant tools but does not specify the version numbers of any software dependencies used in its implementation or experiments. |
| Experiment Setup | Yes | Concretely, we represent states in the Lava World by a grid of pixels for the currently observed room. Each pixel is a binary vector indicating the presence of the agent, lava, empty space, or door respectively. We represent goals as a desired state grid, also represented in terms of pixels, i.e. G S. The data matrix M is constructed from all states, but only the goals in the training set (half of all possible states, randomly selected); a separate three-layer MLP is used for and , and training follows our proposed two-stage approach (lines 17 to 24 in Algorithm 1 below; see also Section 3.1 and Appendix B), and a small rank of n = 7 that provides sufficient training performance (i.e., 90% policy quality, see Figure 5). Figure 7 summarizes the results, showing that it is possible to interpolate the value function to a useful level of quality on the test set of goals (the remaining half of G). |