Enhancing Chess Reinforcement Learning with Graph Representation

Authors: Tomas Rigaux, Hisashi Kashima

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments, performed on smaller networks than the initial Alpha Zero paper, show that this new architecture outperforms previous architectures with a similar number of parameters, being able to increase playing strength an order of magnitude faster. We also show that the model, when trained on a smaller 5 5 variant of chess, is able to be quickly fine-tuned to play on regular 8 8 chess, suggesting that this approach yields promising generalization abilities.
Researcher Affiliation Academia Tomas Rigaux Kyoto University Kyoto, Japan tomas@rigaux.com Hisashi Kashima Kyoto University Kyoto, Japan kashima@i.kyoto-u.ac.jp
Pseudocode Yes Algorithm 1: Self-Play Training
Open Source Code Yes Our code is available at https://github.com/akulen/Alpha Gateau.
Open Datasets No The paper describes using self-play to generate data on 8x8 and 5x5 chess variants. It does not explicitly state the use of a pre-existing, publicly available dataset with concrete access information (link, DOI, citation).
Dataset Splits No The paper does not explicitly provide training/test/validation dataset splits. Data is generated through self-play, and models are evaluated by playing games against each other to estimate Elo ratings.
Hardware Specification Yes All our models were trained using multiple Nvidia RTX A5000 GPUs (Learning speed used 8 and Fine-tuning used 6), and their Elo ratings were estimated using 6 of those GPUs.
Software Dependencies No The paper mentions software like Jax, PGX, Aim, and statsmodels, but does not provide specific version numbers for these dependencies.
Experiment Setup Yes All models used in these experiments are trained with the Adam optimizer [12] with a learning rate of 0.001. All feature vectors have an embedding dimension of 128. The loss function is the same as for the original Alpha Zero, which is, for fθ(s) = π, v, L(π,v, π, v) = πT log( π) + (v v)2. For our experiments, an iteration consists of generating 256 games through self-play, then doing one epoch of training, split into 3904 mini-batches of size 256, after 7 iterations once the frame window is full.