Efficient Learning for AlphaZero via Path Consistency
Authors: Dengwei Zhao, Shikui Tu, Lei Xu
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments also demonstrate the efficiency of PCZero under offline learning setting. Taking Hex, Othello, and Gomoku as examples, the advantage of PCZero will be investigated in both offline and online learning. |
| Researcher Affiliation | Academia | Dengwei Zhao 1 Shikui Tu 1 Lei Xu 1 1Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China. Correspondence to: Shikui Tu <tushikui@sjtu.edu.cn>, Lei Xu <leixu@sjtu.edu.cn>. |
| Pseudocode | Yes | Algorithm 1 v estimation for a terminated sequence; Algorithm 2 MCTS-PCZero self-play; Algorithm 3 Heuristic Path |
| Open Source Code | Yes | The source codes are available at https://github.com/CMACH508/PCZero. |
| Open Datasets | Yes | For Hex, expert dataset is collected by the self-play of Mo Hex 2.0 (Gao et al., 2018), containing 50K, 101K and 18K games for 8 8, 9 9 and 13 13 Hex respectively. WThor8 and Renju Net9 are adopted as the expert dataset for Othello and Gomoku, containing 126K and 70K games respectively. |
| Dataset Splits | No | The paper states: "Those datasets are divided into training set and test set randomly and the proportion of test set is 20%." It does not explicitly mention a separate validation set split. |
| Hardware Specification | Yes | We use 8 Ge Force RTX 2080Ti GPU and Intel(R) Xeon(R) Gold 6130 CPU with 125G RAM to do self-play. A single GTX 1050Ti GPU and Intel i7 8750H CPU with 16 GB RAM are used to test. |
| Software Dependencies | No | No specific software dependencies with version numbers (e.g., PyTorch 1.9, Python 3.8) were listed. |
| Experiment Setup | Yes | During the self-play, MCTS runs 400 simulations to select moves and 1000 games are played in each iteration. For the first 200 epochs, the learning rate r is 0.01 and temperature parameter τ is 0.8. In the following 200 epochs, r = 0.001 and τ = 0.6. For the rest 500 epochs, r = 0.0001 and τ = 0.2. ... λ = 3.0 and l = k = 5. ... cpuct = 1.5 and the search procedure is the same with Alpha Zero (Silver et al., 2018). |