Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Goal-Space Planning with Subgoal Models

Authors: Chunlok Lo, Kevin Roice, Parham Mohammad Panahi, Scott M. Jordan, Adam White, Gabor Mihucz, Farzane Aminmansour, Martha White

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We investigate properties of GSP empirically, in a simpliﬁed setting where we assume subgoals are given to the agent and the subgoal models which are actually UVFAs (Schaul et al., 2015) are learned oﬄine. Our goal is to understand the utility of this planning formalism, without simultaneously solving subgoal discovery and eﬃcient oﬀ-policy UVFA learning. We show that 1) it propagates value and learns an optimal policy faster than its base learner, 2) can perform well with somewhat suboptimal subgoal selection, but can harm performance if subgoals are very poorly selected, 3) is quite robust to inaccuracy in its models and 4) that alternative potential-based rewards and alternatives ways to incorporate subgoal values are not as eﬀective as the particular approach used in GSP. We conclude with a discussion on the large literature of related work and a discussion on the beneﬁts of GSP over other background planning approaches, as well as limitations of this work.
Researcher Affiliation	Academia	Chunlok Lo EMAIL Kevin Roice EMAIL Parham Mohammad Panahi EMAIL Scott M. Jordan EMAIL Adam White EMAIL G abor Mihucz EMAIL Farzane Aminmansour EMAIL Martha White EMAIL Canada CIFAR AI Chair Alberta Machine Intelligence Institute (Amii) Department of Computing Science, University of Alberta Edmonton, Alberta, Canada
Pseudocode	Yes	Algorithm 1 Goal Space Planning with DDQN as a base learner Algorithm 2 Planning()
Open Source Code	No	The paper does not contain any explicit statements about the release of their own source code or links to a repository for the methodology described.
Open Datasets	Yes	The Pin Ball conﬁguration that we used is based on the easy conﬁguration found at https://github.com/Decision Making AI/Benchmark Environments.jl, which was released under the MIT license. We have modiﬁed the environment to support additional features such as changing terminations, visualizing subgoals, and various bug ﬁxes.
Dataset Splits	No	For Hypothesis 1, we collect a single episode of experience from Sarsa(0)+GSP to use as the ﬁxed dataset for all learners. The results are similar to those on Four Rooms, shown in Figure 6. The Sarsa(0) algorithm only updates the value of the tiles activated by the state preceding the goal. Sarsa(λ) has a decaying trail of updates to the tiles activated preceding the goal, and the GSP learners update values at all states in the initiation set of a subgoal. For Hypothesis 2), we measure performance (steps to goal) in both Grid Ball and Pin Ball domains, shown in Figure 7. As before, GSP signiﬁcantly improves the rate of learning, and We ran GSP on Four Rooms (without the lava pools) with each subgoal conﬁguration deﬁned in the previous section. We measured how much time the agent spends in the bottom left room and the top right room.
Hardware Specification	No	The paper mentions "Digital Research Alliance of Canada for the computation resources" but does not provide specific hardware details such as GPU/CPU models or memory.
Software Dependencies	No	The paper mentions using Sarsa(λ), Sarsa(0), Double Deep Q-Network (DDQN), and the Adam optimizer, but does not provide specific version numbers for any software libraries or frameworks. It specifies Adam optimizer hyperparameters (η = 0.001, b1 = 0.9, b2 = 0.999, ϵ = 10-8) but not the software version.
Experiment Setup	Yes	We used a discount factor of γ = 0.99 and Sarsa(λ = 0.9) or Sarsa(0) for the experiments in this section. We used an exploration rate of ϵ = 0.02 in Four Rooms and ϵ = 0.1 in Grid Ball. ϵ was decayed by 0.05% each timestep. All learners used γc = 0.99 and λ = 0.9. For the DDQN base learner, we use α = 0.004, γc = 0.99, ϵ = 0.1, a buﬀer that holds up to 10, 000 transitions a batch size of 32, and a target refresh rate of every 100 steps. The Q-Network weights used Kaiming initialisation (He et al., 2015). We swept its learning rate α over [5 × 10−4, 1 × 10−3, 2 × 10−3, 4 × 10−3, 5 × 10−3] and target refresh rate τ over [1, 50, 100, 200, 1000] as shown in Figure 26. We use the Adam optimizer with η = 0.001 and the other parameters set to the default (b1 = 0.9, b2 = 0.999, ϵ = 10−8), mini-batches of 1024 transitions and 100 epochs.