Policy-Guided Heuristic Search with Guarantees

Authors: Laurent Orseau, Levi H. S. Lelis12382-12390

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show empirically on the sliding-tile puzzle, Sokoban, and a puzzle from the commercial game The Witness that PHS enables the rapid learning of both a policy and a heuristic function and compares favorably with A*, Weighted A*, Greedy Best-First Search, Levin TS, and PUCT in terms of number of problems solved and search time in all three domains tested.
Researcher Affiliation Collaboration 1Deep Mind, UK 2Department of Computing Science, Alberta Machine Intelligence Institute (Amii), University of Alberta, Canada
Pseudocode Yes Algorithm 1 The Policy-guided Heuristic Search algorithm (PHS), based on the Best-First Search algorithm (BFS).
Open Source Code Yes Datasets and code are at https://github.com/levilelis/h-levin.
Open Datasets Yes Sokoban (10 10) ... We use the first 50 000 problems training problems and the provided 1 000 test problems from Boxoban (Guez et al. 2018). The Witness (4 4) ... puzzle extracted from the video game of the same name (Abel et al. 2020) and consists in finding a path on a 2D grid that separates cells of different colors.3 Sliding Tile Puzzle (5 5) The sliding tile puzzle is a traditional benchmark in the heuristic search literature where heuristic functions can be very effective (Korf 1985; Felner, Korf, and Hanan 2004). The training set is generated with random walks from the 5 5 solution state with walks of lengths between 50 and 1 000 steps this is a difficult training set as there are no very easy problems. Test problems are generated randomly and unsolvable problems are filtered out by a parity check; note that this is like scrambling infinitely often, which makes the test problems often harder than the training ones. Datasets and code are at https://github.com/levilelis/h-levin.
Dataset Splits No The paper describes a Bootstrap process where problems solved are used to update model parameters, and explicitly states that "The test problems are not used during training." However, it does not mention a separate, explicit validation dataset split for hyperparameter tuning or model selection.
Hardware Specification No The paper mentions "no GPU" but does not provide specific details such as CPU models, memory, or other detailed computer specifications for the experimental setup.
Software Dependencies No The paper mentions the use of "neural network" and implies standard machine learning libraries, but does not provide specific version numbers for any software or libraries (e.g., Python, TensorFlow, PyTorch versions).
Experiment Setup Yes For PUCT, we normalize the Q-values (Schrittwieser et al. 2020), and use a virtual loss (Chaslot, Winands, and van den Herik 2008) of unit increment with the following selection rule for a child n of a node n, h(n ) = (h(n ) + virtual loss(n ) hmin)/(hmax hmin) PUCT(n ; n) = h(n) cπ(n |n) q P n C(n) N(n ) where N(n) is the number of times the node n has been visited, c is a constant set to 1 in the experiments, and keep in mind that h corresponds to losses hence the negative sign. The virtual loss allows for evaluating the nodes in batch: we first collect 32 nodes for evaluation using the PUCT rule and only then evaluate all nodes in batch with the neural network. Similarly, for the BFS-based algorithms, 32 nodes are evaluated in batch with the neural network before insertion in the priority queue (Agostinelli et al. 2019). We use the mean squared error (MSE) loss for learning the heuristic functions: For a found solution node n the loss for node n anc (n ) is [(d(n ) d(n)) h(n)]2, where h(n) is the output of the network. The cross-entropy loss is used to learn the policy for PUCT.