Chess as a Testbed for Language Model State Tracking
Authors: Shubham Toshniwal, Sam Wiseman, Karen Livescu, Kevin Gimpel11385-11393
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We use the Millionbase dataset which is freely available and has close to 2.9 million quality chess games. ... Tables 5 and 6 show results when predicting starting squares and ending squares, respectively. |
| Researcher Affiliation | Academia | 1Toyota Technological Institute at Chicago 2Duke University {shtoshni, klivescu, kgimpel}@ttic.edu, swiseman@cs.duke.edu |
| Pseudocode | No | No pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | Code and data available at https://github.com/shtoshni/learning-chess-blindfolded |
| Open Datasets | Yes | We use the Millionbase dataset which is freely available and has close to 2.9 million quality chess games. Download link available at https://rebel13.nl/rebel13/rebel%2013.html |
| Dataset Splits | Yes | From this filtered set we randomly select 200K games for training, 15K games each for dev and test, and another 50K games to create board state probing evaluation sets described in Section 3.2. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models used for running the experiments. It only mentions the model architecture (GPT2-small) and training framework (Py Torch Lightning). |
| Software Dependencies | Yes | All experiments are carried out using the Py Torch Lightning framework built on top of Py Torch (Falcon et al. 2019; Paszke et al. 2019). We use the transformers library (Wolf et al. 2019) for all models8 except for the Performer model for which we use a popular unofficial implementation.9 Reformer implementation in transformers library is still a work in progress. The presented results are with the 4.2.2 version. |
| Experiment Setup | Yes | Models are trained for 10 epochs with a batch size of 60. Validation is performed at the end of every epoch and training stops whenever the validation loss starts increasing. For optimization we use Adam (Kingma and Ba 2014) with learning rate of 5 10 4 and L2 weight decay of 0.01. The learning rate is warmed up linearly over the first 10% of training followed by a linear decay. |