Policy-Based Self-Competition for Planning Problems
Authors: Jonathan Pirnay, Quirin Göttl, Jakob Burger, Dominik Gerhard Grimm
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show the effectiveness of our approach in two well-known combinatorial optimization problems, the Traveling Salesman Problem and the Job-Shop Scheduling Problem. With only half of the simulation budget for search, GAZ PTP consistently outperforms all selected single-player variants of GAZ. |
| Researcher Affiliation | Academia | Jonathan Pirnay1,2, Quirin G ottl1, Jakob Burger1 & Dominik G. Grimm1,2 1Technical University of Munich, Campus Straubing for Biotechnology and Sustainability 2Weihenstephan-Triesdorf University of Applied Sciences |
| Pseudocode | Yes | Algorithm 1: GAZ Play-to-Plan (GAZ PTP) Training ... Algorithm 2: GAZ Greedy Scalar Training |
| Open Source Code | Yes | Our code in Py Torch (Paszke et al., 2017) is available on https://github.com/ grimmlab/policy-based-self-competition. |
| Open Datasets | Yes | We report the average performance of all GAZ variants on the 10,000 test instances of Kool et al. (2018)... We report results on instances of medium size 15 15, 20 20, and 30 20 of the well-known Taillard benchmark set (Taillard, 1993)... |
| Dataset Splits | Yes | During training, the model is evaluated periodically on a fixed validation set of 100 states to determine the final model. |
| Hardware Specification | No | The authors gratefully acknowledge the Leibniz Supercomputing Centre for providing computing time on its Linux-Cluster. |
| Software Dependencies | No | Our code in Py Torch (Paszke et al., 2017) is available on https://github.com/ grimmlab/policy-based-self-competition. The paper mentions PyTorch but does not provide a specific version number for PyTorch or any other software dependencies. |
| Experiment Setup | Yes | We set the constants to cvisit = 50 and cscale = 1.0... We set the self-play parameter to γ = 0.2. We use Adam (Kingma & Ba, 2014) as an optimizer, with a constant learning rate of 10 4, sampling batches of size 256 at each training step. Gradients are clipped to unit L2-norm. |