Single Player Monte-Carlo Tree Search Based on the Plackett-Luce Model

Authors: Felix Mohr, Viktor Bengs, Eyke Hüllermeier12373-12381

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Evaluation Experimental Setup Empirical evaluations on the addressed problem are hard to conduct, especially if they require the (very costly) development of a simulator to compute the score function φ. While most previous approaches have only been evaluated on a single domain, we use three benchmark problems with different variations to compare the algorithms. Also the set of baselines has been usually quite limited in that the only baseline has been UCT. In contrast, we compare PL-MCTS against seven algorithms to enable more profound insights. The technical setup is as follows. To maximize comparability, all algorithms are implemented in Java, and the MCTS algorithms only deviate in the implementation of the tree policy. Implementations are available at <anonymized> and seedable such that the results are reproducible. All computations were run on one core of an Intel Xeon Gold "Skylake" 6148, 2.4 GHz with 16GB memory. In order to guarantee a fair comparison, each algorithm was run until a predefined overall time limit was met, which is the same for all algorithms but varies among the problems as specified below. Benchmark Problems We consider three problems to evaluate our approach, which we now describe in detail. To satisfy the needs of UCT, all scores are mapped into the unit interval by exploiting lower and upper bounds respectively; the score in the maximization problem is subtracted from 1 in order to obtain a minimization problem. The Timed TSP with Breaks (TTSPB) extends the classical TSP in two ways. First, the cost to travel from one location to another now depends on the time (hour of day) when the edge is used (to model traffic jams) as suggested in [Picard and Queyranne 1978]. Second, truck drivers must take a break after a time as considered recently in [van der Tuin, de Weerdt, and Batz 2018]. A driver must make a small break of 15 minutes every 4 hours, and a long break (of 8 hours) every 8 hours. We are not aware of any heuristic for this highly state-dependent action cost setup. We consider problems with 20, 30, and 50 locations with timeouts of 1, 4, and 10 minutes. The Job Scheduling With Variable Release Dates (VRDJS) is a scheduling problem of minimizing total weighted flowtime and in which the arrival time of jobs can be chosen but must not be later than a common deadline d. This recently studied problem has shown to be not only NPhard but also that no meaningful heuristics can be generated by currently known techniques [Mohr, Mejía, and Yuraszeck 2021]. We evaluate the case of 100 jobs on 2, 5, and 9 machines with job weights and processing times sampled uniformly from {1, .., 100} and d corresponding to 40% of the total processing time, which constitutes difficult instances. The timeout is 5 minutes in all setups. Same Game is a single-player board game, in which the player has to clear a vertical l l board initially filled randomly with pieces of 5 different colors. Same Game has been subject of evaluation in single-player MCTS [Schadd et al. 2008]. We refer to this paper for details, in which the authors also discuss why Same Game cannot be addressed efficiently with classical A* or IDA*. Timeouts are 5, 15, and 90 minutes for board sizes l = 10, 15, and 20 respectively. Algorithm Setup and Baselines PL-MCTS can be parametrized in the preference kernel κ and the influence function γ. Using a bootstrapping based kernel for κ, one can configure the number, size, and aggregation statistic. The only reason to set the first two parameters low are resource limitations, and, using generous values of 1000 bootstraps (rankings) based on 100 samples per child each, we will see that we do not observe performance issues. As an aggregation function, we use the mean value of each bootstrap sample to derive rankings as discussed above as a default. In addition, we show results that would result from choosing the mean standard deviation or the minimum as an aggregation function. Regarding the influence function γ, the above resource-aware proposal constitutes a reasonable default setup. An estimate of the number of roll-outs T is updated in each iteration based on the empirical roll-out times and the time budget (in terms of wall-time), and thresholds for τ1 and τ2 are updated based on this estimate (and hence vary over time for each node). In the following, we describe the algorithms against which we compare PL-MCTS in a default parameter evaluation. Comparing MCTS algorithms is a tricky undertaking, because these algorithms have parameters, and the success of an algorithm can substantially vary with different parameter values. The two canonical possibilities for evaluation are to either fix one default value for each of the algorithms that appears reasonable or to optimize over the parameters and search the best parameters for each algorithm in each setup. The problem of the latter is that it is methodologically difficult to decide which optimization is fair and which is not. For example, if one algorithm has five parameters and another algorithm has only one, the optimization process for the second algorithm is much faster. How much time is allowed for this optimization? In fact, doing this kind of optimization in fact already is a kind of search on a meta-level, and the used resources should count into the overall time budget of each considered algorithm. While this approach is not uninteresting, we are not aware that it has been done in a methodologically clean way up to date. Instead, we base our evaluation on a default parametrization per algorithm in which all algorithms are equipped with a reasonable paramatrization that should work well on average . The exact parametrizations can be found in the supplement. The variants of MCTS we consider are as follows. First, we consider with UCT [Kocsis, Szepesvári, and Willemson 2006], DNG [Bai, Wu, and Chen 2013], and BRUE [Feldman and Domshlak 2014] three state-of-the-art MCTS approaches to learn policies for MDPs. The parametrizations are fixed to reasonable default values; details per algorithm are contained in the supplement. Second, we consider the single-player versions of UCT as proposed in [Schadd et al. 2008], which applies a slight modification of UCB1 considering not only the mean but also the standard deviation of the samples; this is denoted as SP-UCT. Third, we consider the TAG algorithm [De Mesmay et al. 2009], which is an MCTS approach specifically taylored for single-player games based on extreme bandits. We also include two algorithms specifically dedicated to single player games even though not directly fitting into the MCTS scheme. The first is Nested Monte Carlo Tree Search (NMCS) [Cazenave 2009]. NMCS is simply a random search that, instead of starting a single random search at the root, starts one random search under each node in a predefined depth (level). The algorithm then simply memorizes the best solution seen so far, and this process is repeated until the time budget is exhausted. There is no notion of using observations to guide the search; in particular there is nothing like a tree policy. The Nested Rollout Policy Adaptation for Monte Carlo Tree Search (NRPA) is a kind of extension of this algorithm [Rosin 2011]. The idea to initiate different searches in a pre-defined depth is the same, but NRPA adopts a learning search process under those nodes instead of drawing only one single sample. While it also does not make use of the explicit concept of a tree policy or default policy, it does incorporate observations into the decision making process by modifying the probability of the children, more precisely by increasing the probabilities of better children to be drawn the next time. It therefore evolves the policy over time. In a way, it can be seen as an algorithm that initiates for each node in depth l a learning search procedure that is independent of the other nodes under which such a search is started. Like NMCS, it returns the best result observed during search. In addition, we add two algorithms to the baseline portfolio, which are extreme in the exploration-exploitation spectrum. On the exploration side, we add a simple random search (RANDOM), which does not take into account the observations made during search at all. On the exploitation side, we add a classical best-first (BF) search that gains its node evaluations from a fixed number of 3 random samples (taking the minimum value of them). Experimental Results The experimental results are summarized in Table 1. Due to the different ranges of possible scores in the different instances, we report the relative scores in that the score of each algorithm is (score best)/best on instances of the minimization problems and score/best on instances of Same Game, where best is the best observed score of any algorithm in the respective instance. We report the empirical average of those values over 100 different instances and their empirical standard deviation. Best average scores are in bold, while the results whose distribution might be identical to the best (according to a Wilcoxon signed rank test with 95% significance level) are underlined. We do not want to give the impression to cherrypick among the different kernels for PL-MCTS. Therefore, PL-MCTS is primarily represented by the mean kernel as a choice to use PL-MCTS out of the box. However, to increase insights, the results for the other two kernels are still provided in the separated last two rows. They are not considered in the computation of the best algorithm performances (hence never marked in bold even if best). Instead, we use the symbol to indicate that the respective kernel would have improved upon the best other solution in that scenario. The experimental evaluation gives clear evidence that PL-MCTS is a highly competitive algorithm for the addressed problem. On almost all problems, PL-MCTS achieves best results and in several cases with substantial gaps to the runner-up, e.g., in all TTSPB problems, and in the 2 and 5 machine problems of the VRDJS. PL-MCTS is in no problem substantially worse than any other tree policy; the only algorithm that achieves to outperform PL-MCTS in one case is the BF algorithm on the 15x15 Same Game problem. The mediocre performance of theoretically sustained UCT and BRUE should not surprise. First, the bounds are for cumulative and simple regret, which we discussed above to be not particularly relevant for our problem. Second, the regret bounds are fairly loose and hence, admit slow convergence. Another remarkable observation is that SP-UCT and the TAG algorithm, both being specifically tailored for this problem class, largely fail to produce competitive results. The TAG algorithm is extremely greedy and similar to the minimum preference kernel. The important difference in this case is that PL-MCTS is not so quick in the application of the policy and hence allows for more exploration prior to committing to the node recommended by the policy. The 15x15 Same Game scenario is interesting as it constitutes a scenario in which following the mean is a bad advisor. In fact, all mean-oriented techniques like UCT, BRUE, DNG, PL-MCTS-MEAN are not so stable in reaching the high-quality regions. The fact that TAG also fails on this problem shows that here it is a good idea to follow the minimum but not all too greedy (PL-MCTS-MIN has the advantage of not committing too quickly and to regularize the observations by bootstrapping). It is also noteworthy that the potential performance bottlenecks discussed above were successfully eliminated by the proposed remedies. In the supplement we provide a detailed overview of the achieved iterations of each algorithm from which it becomes evident that PL-MCTS is not significantly slower than the competitors and even often faster than DNG. Our conclusion from this evaluation is that PL-MCTS is a very promising approach for finding good solutions to optimal path search problems for which no meaningful heuristic is available. While providing a solid fall-back with the mean kernel, its flexibility in the control strategy via the preference kernels points to interesting future work in the question to learn which kernel to use in a particular domain or even to detect online which one to use.
Researcher Affiliation Academia 1 Universidad de La Sabana, Campus del Puente del Común, Km. 7, Autopista Norte de Bogotá, Chía, Colombia 2 Paderborn University, Warburgerstraße 100, Paderborn, Germany
Pseudocode Yes Algorithm 1: Monte Carlo Tree Path Search create root node n0, L = while within computational budget T do /* iteratively call π(s, ni, T) */ n TREEPOLICY.GETOPENNODE(n0) l DEFAULTPOLICY(n) L = L {l} /* update state s used for π */ TREEPOLICY.UPDATE( n0, .., n , φ(l)) end return n0, .., l with l argminl L φ(l)
Open Source Code Yes The code of PL-MCTS is published in the Java library AILibs1 and can be directly used through a Maven artifact2. 1https://github.com/starlibs/AILibs 2https://mvnrepository.com/artifact/ai.libs/jaicore-search/0.2.4
Open Datasets No The paper describes benchmark problems (TTSPB, VRDJS, Same Game) and how instances were generated or configured (e.g., "Job weights and processing times sampled uniformly from {1, .., 100}" for VRDJS). While it cites papers related to these problems, it does not provide explicit access information (links, DOIs, specific repositories) for the *exact datasets or instance sets* used in their experiments, nor does it refer to them as standard publicly available datasets with specific access points for reproduction.
Dataset Splits No The paper describes optimization and search problems, and the evaluation involves running algorithms until a predefined time limit or number of rollouts. It does not mention traditional machine learning dataset splits (e.g., train/validation/test percentages or counts) for reproduction. The problems involve generating or configuring instances for search, rather than using fixed datasets with pre-defined splits.
Hardware Specification Yes All computations were run on one core of an Intel Xeon Gold "Skylake" 6148, 2.4 GHz with 16GB memory.
Software Dependencies No The paper states "all algorithms are implemented in Java", but it does not provide specific version numbers for Java itself or any other key software libraries, frameworks, or solvers that would be required to reproduce the experiments.
Experiment Setup Yes PL-MCTS can be parametrized in the preference kernel κ and the influence function γ. Using a bootstrapping based kernel for κ, one can configure the number, size, and aggregation statistic. The only reason to set the first two parameters low are resource limitations, and, using generous values of 1000 bootstraps (rankings) based on 100 samples per child each, we will see that we do not observe performance issues. As an aggregation function, we use the mean value of each bootstrap sample to derive rankings as discussed above as a default.