reproducibilityindex.ai

Policy-Guided Heuristic Search with Guarantees

Authors: Laurent Orseau, Levi H. S. Lelis12382-12390

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show empirically on the sliding-tile puzzle, Sokoban, and a puzzle from the commercial game The Witness that PHS enables the rapid learning of both a policy and a heuristic function and compares favorably with A, Weighted A, Greedy Best-First Search, Levin TS, and PUCT in terms of number of problems solved and search time in all three domains tested.
Researcher Affiliation	Collaboration	1Deep Mind, UK 2Department of Computing Science, Alberta Machine Intelligence Institute (Amii), University of Alberta, Canada
Pseudocode	Yes	Algorithm 1 The Policy-guided Heuristic Search algorithm (PHS), based on the Best-First Search algorithm (BFS).
Open Source Code	Yes	Datasets and code are at https://github.com/levilelis/h-levin.
Open Datasets	Yes	Sokoban (10 10) ... We use the ﬁrst 50 000 problems training problems and the provided 1 000 test problems from Boxoban (Guez et al. 2018). The Witness (4 4) ... puzzle extracted from the video game of the same name (Abel et al. 2020) and consists in ﬁnding a path on a 2D grid that separates cells of different colors.3 Sliding Tile Puzzle (5 5) The sliding tile puzzle is a traditional benchmark in the heuristic search literature where heuristic functions can be very effective (Korf 1985; Felner, Korf, and Hanan 2004). The training set is generated with random walks from the 5 5 solution state with walks of lengths between 50 and 1 000 steps this is a difﬁcult training set as there are no very easy problems. Test problems are generated randomly and unsolvable problems are ﬁltered out by a parity check; note that this is like scrambling inﬁnitely often, which makes the test problems often harder than the training ones. Datasets and code are at https://github.com/levilelis/h-levin.
Dataset Splits	No	The paper describes a Bootstrap process where problems solved are used to update model parameters, and explicitly states that "The test problems are not used during training." However, it does not mention a separate, explicit validation dataset split for hyperparameter tuning or model selection.
Hardware Specification	No	The paper mentions "no GPU" but does not provide specific details such as CPU models, memory, or other detailed computer specifications for the experimental setup.
Software Dependencies	No	The paper mentions the use of "neural network" and implies standard machine learning libraries, but does not provide specific version numbers for any software or libraries (e.g., Python, TensorFlow, PyTorch versions).
Experiment Setup	Yes	For PUCT, we normalize the Q-values (Schrittwieser et al. 2020), and use a virtual loss (Chaslot, Winands, and van den Herik 2008) of unit increment with the following selection rule for a child n of a node n, h(n ) = (h(n ) + virtual loss(n ) hmin)/(hmax hmin) PUCT(n ; n) = h(n) cπ(n \|n) q P n C(n) N(n ) where N(n) is the number of times the node n has been visited, c is a constant set to 1 in the experiments, and keep in mind that h corresponds to losses hence the negative sign. The virtual loss allows for evaluating the nodes in batch: we ﬁrst collect 32 nodes for evaluation using the PUCT rule and only then evaluate all nodes in batch with the neural network. Similarly, for the BFS-based algorithms, 32 nodes are evaluated in batch with the neural network before insertion in the priority queue (Agostinelli et al. 2019). We use the mean squared error (MSE) loss for learning the heuristic functions: For a found solution node n the loss for node n anc (n ) is [(d(n ) d(n)) h(n)]2, where h(n) is the output of the network. The cross-entropy loss is used to learn the policy for PUCT.