Single-Agent Policy Tree Search With Guarantees
Authors: Laurent Orseau, Levi Lelis, Tor Lattimore, Theophane Weber
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate these tree search algorithms on 1,000 computer-generated levels of Sokoban, where the policy used to guide the search comes from a neural network trained using A3C. Our results show that the policy tree search algorithms we introduce are competitive with a state-of-the-art domain-independent planner that uses heuristic search. |
| Researcher Affiliation | Collaboration | Laurent Orseau Deep Mind, London, UK lorseau@google.com; Levi H. S. Lelis Universidade Federal de Viçosa, Brazil levi.lelis@ufv.br; Tor Lattimore Deep Mind, London, UK lattimore@google.com; Théophane Weber Deep Mind, London, UK theophane@google.com |
| Pseudocode | Yes | Algorithm 1: Levin tree search. ... Algorithm 2: Sampling and execution of a single trajectory. ... Algorithm (3) Sampling of nsims trajectories of fixed depths dmax N1. ... Algorithm (4) Sampling of nsims trajectories of depths that follow A6519, with optional coefficient dmin N1. |
| Open Source Code | No | The paper provides a link for the computer-generated levels (data) used in the experiments: "The levels are available at https://github.com/deepmind/boxoban-levels/unfiltered/test." However, it does not provide concrete access to the source code for the proposed Levin TS or Luby TS methodologies. |
| Open Datasets | Yes | We test our algorithms on 1,000 computer-generated levels of Sokoban [Racanière et al., 2017] of 10x10 grid cells and 4 boxes. The levels are available at https://github.com/deepmind/boxoban-levels/unfiltered/test. |
| Dataset Splits | No | The paper mentions "1,000 computer-generated levels of Sokoban" for evaluation, and that the A3C policy was "learned only from the 65% easiest levels". However, it does not explicitly provide specific dataset split information (e.g., percentages, sample counts, or citations to predefined splits) for training, validation, and testing of their proposed algorithms on these 1000 levels, nor does it detail how the 65% easiest levels were determined for policy training in a way that allows reproduction. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory specifications, or cloud instance types) used for running its experiments. |
| Software Dependencies | No | The paper mentions the use of "A3C" for training the policy and comparison with "LAMA planner" and "Fast Downward", but it does not specify software names with version numbers (e.g., Python version, specific library versions like PyTorch or TensorFlow, or exact versions of the planners). |
| Experiment Setup | Yes | We test the following algorithms and parameters: Luby TS(256,1), Luby TS(256,32), Luby TS(512, 32), multi TS(1, 200), multi TS(100, 200), multi TS(200, 200), Levin TS. ... In addition to the policy trained with A3C, we tested Levin TS, Luby TS, and multi TS with a variant of the policy in which we add 1% of noise to the probabilities output of the neural network. That is, these variants use the policy π(a|n) = (1 ε)π(a|n) + ε 1/4 where π is the network s policy and ε = 0.01, to guide their search. |