reproducibilityindex.ai

Combining Q-Learning and Search with Amortized Value Estimates

Authors: Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Tobias Pfaff, Theophane Weber, Lars Buesing, Peter W. Battaglia

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluated SAVE in four distinct settings that vary in their branching factor, sparsity of rewards, and episode length. First, we demonstrate through a new Tightrope environment that SAVE performs well in settings where count-based policy approaches struggle, as discussed in Section 2.2. Next, we show that SAVE scales to the challenging Construction domain (Bapst et al., 2019) and that it alleviates the problem with off-policy actions discussed in Section 2.1. We also perform several ablations to tease apart the details of SAVE. Finally, we demonstrate that SAVE dramatically improves over Q-learning in a new and even more difﬁcult construction task called Marble Run, as well as in more standard environments like Atari (Bellemare et al., 2013).
Researcher Affiliation	Industry	Jessica B. Hamrick Deep Mind jhamrick@google.com Victor Bapst Deep Mind vbapst@google.com Alvaro Sanchez-Gonzalez Deep Mind alvarosg@google.com Tobias Pfaff Deep Mind tpfaff@google.com Th eophane Weber Deep Mind theophane@google.com Lars Buesing Deep Mind lbuesing@google.com Peter W. Battaglia Deep Mind peterbattaglia@google.com
Pseudocode	Yes	Algorithm A.1 Pseudocode for the SAVE algorithm.
Open Source Code	No	The paper mentions using TensorFlow and Sonnet but does not state that the code for the described methodology (SAVE) is open-source or provide a link.
Open Datasets	Yes	We evaluated SAVE on a set of 14 Atari games in the Arcade Learning Environment (Bellemare et al., 2013).
Dataset Splits	Yes	Under the adaptive curriculum, we only allowed an agent to progress to the next level of difﬁculty after it was able to solve at least 50% of the scenes at the current level of difﬁculty.
Hardware Specification	Yes	In all experiments except Tabular Tightrope (see Section B.2) and Atari (see Appendix E), we use a distributed training setup with 1 GPU learner and 64 CPU actors.
Software Dependencies	No	The paper mentions 'Tensor Flow (Abadi et al., 2016) and Sonnet (Reynolds et al., 2017), and gradient descent was performed using the Adam optimizer (Kingma & Ba, 2014)' but does not specify version numbers for these software components.
Experiment Setup	Yes	For all experiments, we used a batch size of 16, a learning rate of 0.0002, a replay size of 4000 transitions (with a minimum history of 100 transitions), a replay ratio of 4, and updated the target network every 100 learning steps. ... For both the SAVE and PUCT agents we used loss coefﬁcients of βQ = 0.5 and βA = 0.5.