Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Designing Skill-Compatible AI: Methodologies and Frameworks in Chess
Authors: Karim Hamade, Reid McIlroy-Young, Siddhartha Sen, Jon Kleinberg, Ashton Anderson
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our agents outperform state-of-the-art chess AI (based on Alpha Zero) despite being weaker in conventional chess, demonstrating that skill-compatibility is a tangible trait that is qualitatively and measurably distinct from raw performance. Our evaluations further explore and clarify the mechanisms by which our agents achieve skill-compatibility. |
| Researcher Affiliation | Collaboration | Karim Hamade Reid Mc Ilroy-Young Siddhartha Sen EMAIL EMAIL EMAIL University of Toronto University of Toronto Microsoft Research Jon Kleinberg Ashton Anderson EMAIL EMAIL Cornell University University of Toronto |
| Pseudocode | No | The paper does not contain any sections explicitly labeled as "Pseudocode" or "Algorithm", nor does it present structured steps in a code-like format. |
| Open Source Code | Yes | Our code is released at github.com/CSSLab/skill-compatibility-chess. We also include several of our trained models. |
| Open Datasets | No | The paper states maia was trained on games from lichess.org, an open-source platform. However, it does not provide a direct link, DOI, specific repository for the *dataset* used, or a formal citation of the dataset itself, only the platform source. |
| Dataset Splits | Yes | To create att, a dataset of 10000 games (80% train, 10% validate, and 10% test) is generated of the following game leela maia leela maia for STT or leela maia leela maia for HB. |
| Hardware Specification | Yes | We made use of four Tesla K80 GPU s for the purpose of experimentation, each with a VRAM of 12 GB. |
| Software Dependencies | Yes | Against stockfish 13 (60k nodes), a strong classical engine that uses alpha-beta search, this version of leela obtains a score of 59 3. |
| Experiment Setup | Yes | To create att, a dataset of 10000 games (80% train, 10% validate, and 10% test) is generated of the following game leela maia leela maia for STT or leela maia leela maia for HB. Then, starting with leela s weights, and using a learning rate of 10 5, and 10000 iterations, we run back-propagation to update leela s policy and value neural network. |