Policy-Based Self-Competition for Planning Problems

Authors: Jonathan Pirnay, Quirin Göttl, Jakob Burger, Dominik Gerhard Grimm

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show the effectiveness of our approach in two well-known combinatorial optimization problems, the Traveling Salesman Problem and the Job-Shop Scheduling Problem. With only half of the simulation budget for search, GAZ PTP consistently outperforms all selected single-player variants of GAZ.
Researcher Affiliation Academia Jonathan Pirnay1,2, Quirin G ottl1, Jakob Burger1 & Dominik G. Grimm1,2 1Technical University of Munich, Campus Straubing for Biotechnology and Sustainability 2Weihenstephan-Triesdorf University of Applied Sciences
Pseudocode Yes Algorithm 1: GAZ Play-to-Plan (GAZ PTP) Training ... Algorithm 2: GAZ Greedy Scalar Training
Open Source Code Yes Our code in Py Torch (Paszke et al., 2017) is available on https://github.com/ grimmlab/policy-based-self-competition.
Open Datasets Yes We report the average performance of all GAZ variants on the 10,000 test instances of Kool et al. (2018)... We report results on instances of medium size 15 15, 20 20, and 30 20 of the well-known Taillard benchmark set (Taillard, 1993)...
Dataset Splits Yes During training, the model is evaluated periodically on a fixed validation set of 100 states to determine the final model.
Hardware Specification No The authors gratefully acknowledge the Leibniz Supercomputing Centre for providing computing time on its Linux-Cluster.
Software Dependencies No Our code in Py Torch (Paszke et al., 2017) is available on https://github.com/ grimmlab/policy-based-self-competition. The paper mentions PyTorch but does not provide a specific version number for PyTorch or any other software dependencies.
Experiment Setup Yes We set the constants to cvisit = 50 and cscale = 1.0... We set the self-play parameter to γ = 0.2. We use Adam (Kingma & Ba, 2014) as an optimizer, with a constant learning rate of 10 4, sampling batches of size 256 at each training step. Gradients are clipped to unit L2-norm.