Universal Value Function Approximators

Authors: Tom Schaul, Daniel Horgan, Karol Gregor, David Silver

ICML 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We ran several experiments to investigate the generalisation capabilities of UVFAs. In each case, the scenario is one of supervised learning, where the ground truth values V g (s) or Q g(s, a) are only given for some training set of pairs (s, g). We trained a UVFA on that data, and evaluated its generalisation capability in two ways. First, we measured the prediction error (MSE) on the value of a held-out set of unseen (s, g) pairs. Second, we measured the policy quality of a value function approximator Q(s, a, g; ) to be the true expected discounted reward according to its goal g, averaged over all start states, when following the soft-max policy of these values with temperature .
Researcher Affiliation Industry Tom Schaul SCHAUL@GOOGLE.COM Dan Horgan HORGAN@GOOGLE.COM Karol Gregor KAROLG@GOOGLE.COM David Silver DAVIDSILVER@GOOGLE.COM Google Deep Mind, 5 New Street Square, EC4A 3TW London
Pseudocode Yes Algorithm 1 UVFA learning from Horde targets
Open Source Code No The paper mentions using existing libraries and frameworks like Torch7 and Adam, but does not provide any explicit statement or link for the open-source code of its own methodology.
Open Datasets Yes On the Atari game of Ms Pacman, we then demonstrate that UVFAs can scale to larger visual input spaces and different types of goals... We used a hand-crafted goal space G: for each pellet on the screen, we defined eating it as an individual goal g R2, which is represented by the pellet s (x, y) coordinate on-screen. Following Algorithm 1, a Horde with 150 demons was trained. Each demon processed the visual input directly from the screen (see Appendix C for further experimental details).
Dataset Splits Yes We trained a UVFA on that data, and evaluated its generalisation capability in two ways. First, we measured the prediction error (MSE) on the value of a held-out set of unseen (s, g) pairs. Second, we measured the policy quality of a value function approximator Q(s, a, g; ) to be the true expected discounted reward according to its goal g, averaged over all start states, when following the soft-max policy of these values with temperature .
Hardware Specification No The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used for running the experiments.
Software Dependencies No The paper references Torch7 and Adam as relevant tools but does not specify the version numbers of any software dependencies used in its implementation or experiments.
Experiment Setup Yes Concretely, we represent states in the Lava World by a grid of pixels for the currently observed room. Each pixel is a binary vector indicating the presence of the agent, lava, empty space, or door respectively. We represent goals as a desired state grid, also represented in terms of pixels, i.e. G S. The data matrix M is constructed from all states, but only the goals in the training set (half of all possible states, randomly selected); a separate three-layer MLP is used for and , and training follows our proposed two-stage approach (lines 17 to 24 in Algorithm 1 below; see also Section 3.1 and Appendix B), and a small rank of n = 7 that provides sufficient training performance (i.e., 90% policy quality, see Figure 5). Figure 7 summarizes the results, showing that it is possible to interpolate the value function to a useful level of quality on the test set of goals (the remaining half of G).