Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Goal-Space Planning with Subgoal Models
Authors: Chunlok Lo, Kevin Roice, Parham Mohammad Panahi, Scott M. Jordan, Adam White, Gabor Mihucz, Farzane Aminmansour, Martha White
JMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We investigate properties of GSP empirically, in a simplified setting where we assume subgoals are given to the agent and the subgoal models which are actually UVFAs (Schaul et al., 2015) are learned offline. Our goal is to understand the utility of this planning formalism, without simultaneously solving subgoal discovery and efficient off-policy UVFA learning. We show that 1) it propagates value and learns an optimal policy faster than its base learner, 2) can perform well with somewhat suboptimal subgoal selection, but can harm performance if subgoals are very poorly selected, 3) is quite robust to inaccuracy in its models and 4) that alternative potential-based rewards and alternatives ways to incorporate subgoal values are not as effective as the particular approach used in GSP. We conclude with a discussion on the large literature of related work and a discussion on the benefits of GSP over other background planning approaches, as well as limitations of this work. |
| Researcher Affiliation | Academia | Chunlok Lo EMAIL Kevin Roice EMAIL Parham Mohammad Panahi EMAIL Scott M. Jordan EMAIL Adam White EMAIL G abor Mihucz EMAIL Farzane Aminmansour EMAIL Martha White EMAIL Canada CIFAR AI Chair Alberta Machine Intelligence Institute (Amii) Department of Computing Science, University of Alberta Edmonton, Alberta, Canada |
| Pseudocode | Yes | Algorithm 1 Goal Space Planning with DDQN as a base learner Algorithm 2 Planning() |
| Open Source Code | No | The paper does not contain any explicit statements about the release of their own source code or links to a repository for the methodology described. |
| Open Datasets | Yes | The Pin Ball configuration that we used is based on the easy configuration found at https://github.com/Decision Making AI/Benchmark Environments.jl, which was released under the MIT license. We have modified the environment to support additional features such as changing terminations, visualizing subgoals, and various bug fixes. |
| Dataset Splits | No | For Hypothesis 1, we collect a single episode of experience from Sarsa(0)+GSP to use as the fixed dataset for all learners. The results are similar to those on Four Rooms, shown in Figure 6. The Sarsa(0) algorithm only updates the value of the tiles activated by the state preceding the goal. Sarsa(λ) has a decaying trail of updates to the tiles activated preceding the goal, and the GSP learners update values at all states in the initiation set of a subgoal. For Hypothesis 2), we measure performance (steps to goal) in both Grid Ball and Pin Ball domains, shown in Figure 7. As before, GSP significantly improves the rate of learning, and We ran GSP on Four Rooms (without the lava pools) with each subgoal configuration defined in the previous section. We measured how much time the agent spends in the bottom left room and the top right room. |
| Hardware Specification | No | The paper mentions "Digital Research Alliance of Canada for the computation resources" but does not provide specific hardware details such as GPU/CPU models or memory. |
| Software Dependencies | No | The paper mentions using Sarsa(λ), Sarsa(0), Double Deep Q-Network (DDQN), and the Adam optimizer, but does not provide specific version numbers for any software libraries or frameworks. It specifies Adam optimizer hyperparameters (η = 0.001, b1 = 0.9, b2 = 0.999, ϵ = 10-8) but not the software version. |
| Experiment Setup | Yes | We used a discount factor of γ = 0.99 and Sarsa(λ = 0.9) or Sarsa(0) for the experiments in this section. We used an exploration rate of ϵ = 0.02 in Four Rooms and ϵ = 0.1 in Grid Ball. ϵ was decayed by 0.05% each timestep. All learners used γc = 0.99 and λ = 0.9. For the DDQN base learner, we use α = 0.004, γc = 0.99, ϵ = 0.1, a buffer that holds up to 10, 000 transitions a batch size of 32, and a target refresh rate of every 100 steps. The Q-Network weights used Kaiming initialisation (He et al., 2015). We swept its learning rate α over [5 × 10−4, 1 × 10−3, 2 × 10−3, 4 × 10−3, 5 × 10−3] and target refresh rate τ over [1, 50, 100, 200, 1000] as shown in Figure 26. We use the Adam optimizer with η = 0.001 and the other parameters set to the default (b1 = 0.9, b2 = 0.999, ϵ = 10−8), mini-batches of 1024 transitions and 100 epochs. |