reproducibilityindex.ai

Chess as a Testbed for Language Model State Tracking

Authors: Shubham Toshniwal, Sam Wiseman, Karen Livescu, Kevin Gimpel11385-11393

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We use the Millionbase dataset which is freely available and has close to 2.9 million quality chess games. ... Tables 5 and 6 show results when predicting starting squares and ending squares, respectively.
Researcher Affiliation	Academia	1Toyota Technological Institute at Chicago 2Duke University {shtoshni, klivescu, kgimpel}@ttic.edu, swiseman@cs.duke.edu
Pseudocode	No	No pseudocode or algorithm blocks were found in the paper.
Open Source Code	Yes	Code and data available at https://github.com/shtoshni/learning-chess-blindfolded
Open Datasets	Yes	We use the Millionbase dataset which is freely available and has close to 2.9 million quality chess games. Download link available at https://rebel13.nl/rebel13/rebel%2013.html
Dataset Splits	Yes	From this ﬁltered set we randomly select 200K games for training, 15K games each for dev and test, and another 50K games to create board state probing evaluation sets described in Section 3.2.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU or CPU models used for running the experiments. It only mentions the model architecture (GPT2-small) and training framework (Py Torch Lightning).
Software Dependencies	Yes	All experiments are carried out using the Py Torch Lightning framework built on top of Py Torch (Falcon et al. 2019; Paszke et al. 2019). We use the transformers library (Wolf et al. 2019) for all models8 except for the Performer model for which we use a popular unofﬁcial implementation.9 Reformer implementation in transformers library is still a work in progress. The presented results are with the 4.2.2 version.
Experiment Setup	Yes	Models are trained for 10 epochs with a batch size of 60. Validation is performed at the end of every epoch and training stops whenever the validation loss starts increasing. For optimization we use Adam (Kingma and Ba 2014) with learning rate of 5 10 4 and L2 weight decay of 0.01. The learning rate is warmed up linearly over the ﬁrst 10% of training followed by a linear decay.