Combining Q-Learning and Search with Amortized Value Estimates
Authors: Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Tobias Pfaff, Theophane Weber, Lars Buesing, Peter W. Battaglia
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluated SAVE in four distinct settings that vary in their branching factor, sparsity of rewards, and episode length. First, we demonstrate through a new Tightrope environment that SAVE performs well in settings where count-based policy approaches struggle, as discussed in Section 2.2. Next, we show that SAVE scales to the challenging Construction domain (Bapst et al., 2019) and that it alleviates the problem with off-policy actions discussed in Section 2.1. We also perform several ablations to tease apart the details of SAVE. Finally, we demonstrate that SAVE dramatically improves over Q-learning in a new and even more difficult construction task called Marble Run, as well as in more standard environments like Atari (Bellemare et al., 2013). |
| Researcher Affiliation | Industry | Jessica B. Hamrick Deep Mind jhamrick@google.com Victor Bapst Deep Mind vbapst@google.com Alvaro Sanchez-Gonzalez Deep Mind alvarosg@google.com Tobias Pfaff Deep Mind tpfaff@google.com Th eophane Weber Deep Mind theophane@google.com Lars Buesing Deep Mind lbuesing@google.com Peter W. Battaglia Deep Mind peterbattaglia@google.com |
| Pseudocode | Yes | Algorithm A.1 Pseudocode for the SAVE algorithm. |
| Open Source Code | No | The paper mentions using TensorFlow and Sonnet but does not state that the code for the described methodology (SAVE) is open-source or provide a link. |
| Open Datasets | Yes | We evaluated SAVE on a set of 14 Atari games in the Arcade Learning Environment (Bellemare et al., 2013). |
| Dataset Splits | Yes | Under the adaptive curriculum, we only allowed an agent to progress to the next level of difficulty after it was able to solve at least 50% of the scenes at the current level of difficulty. |
| Hardware Specification | Yes | In all experiments except Tabular Tightrope (see Section B.2) and Atari (see Appendix E), we use a distributed training setup with 1 GPU learner and 64 CPU actors. |
| Software Dependencies | No | The paper mentions 'Tensor Flow (Abadi et al., 2016) and Sonnet (Reynolds et al., 2017), and gradient descent was performed using the Adam optimizer (Kingma & Ba, 2014)' but does not specify version numbers for these software components. |
| Experiment Setup | Yes | For all experiments, we used a batch size of 16, a learning rate of 0.0002, a replay size of 4000 transitions (with a minimum history of 100 transitions), a replay ratio of 4, and updated the target network every 100 learning steps. ... For both the SAVE and PUCT agents we used loss coefficients of βQ = 0.5 and βA = 0.5. |