Chess as a Testbed for Language Model State Tracking

Authors: Shubham Toshniwal, Sam Wiseman, Karen Livescu, Kevin Gimpel11385-11393

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We use the Millionbase dataset which is freely available and has close to 2.9 million quality chess games. ... Tables 5 and 6 show results when predicting starting squares and ending squares, respectively.
Researcher Affiliation Academia 1Toyota Technological Institute at Chicago 2Duke University {shtoshni, klivescu, kgimpel}@ttic.edu, swiseman@cs.duke.edu
Pseudocode No No pseudocode or algorithm blocks were found in the paper.
Open Source Code Yes Code and data available at https://github.com/shtoshni/learning-chess-blindfolded
Open Datasets Yes We use the Millionbase dataset which is freely available and has close to 2.9 million quality chess games. Download link available at https://rebel13.nl/rebel13/rebel%2013.html
Dataset Splits Yes From this filtered set we randomly select 200K games for training, 15K games each for dev and test, and another 50K games to create board state probing evaluation sets described in Section 3.2.
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models used for running the experiments. It only mentions the model architecture (GPT2-small) and training framework (Py Torch Lightning).
Software Dependencies Yes All experiments are carried out using the Py Torch Lightning framework built on top of Py Torch (Falcon et al. 2019; Paszke et al. 2019). We use the transformers library (Wolf et al. 2019) for all models8 except for the Performer model for which we use a popular unofficial implementation.9 Reformer implementation in transformers library is still a work in progress. The presented results are with the 4.2.2 version.
Experiment Setup Yes Models are trained for 10 epochs with a batch size of 60. Validation is performed at the end of every epoch and training stops whenever the validation loss starts increasing. For optimization we use Adam (Kingma and Ba 2014) with learning rate of 5 10 4 and L2 weight decay of 0.01. The learning rate is warmed up linearly over the first 10% of training followed by a linear decay.