reproducibilityindex.ai

Evaluation beyond Task Performance: Analyzing Concepts in AlphaZero in Hex

Authors: Charles Lovering, Jessica Forde, George Konidaris, Ellie Pavlick, Michael Littman

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We investigate Alpha Zero s internal representations in the game of Hex using two evaluation techniques from natural language processing (NLP): model probing and behavioral tests. In doing so, we introduce new evaluation tools to the RL community, and illustrate how evaluations other than task performance can be used to provide a more complete picture of a model s strengths and weaknesses. The total compute across all experiments was about 24 GPU hours.
Researcher Affiliation	Academia	Charles Lovering Jessica Zosa Forde George Konidaris Ellie Pavlick Michael L. Littman Department of Computer Science Brown University {first}_{last}@brown.edu
Pseudocode	No	The paper describes methods and processes in text but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our code and results are publicly available [38]. Furthermore, we release example images of boards created for our probing classifiers and videos of the behavioral tests. The code, results and examples can be found at https://bit.ly/alphatology. Our repository is also available on Git Hub at https://github.com/jzf2101/alphatology.
Open Datasets	Yes	Our code and results are publicly available [38]. Furthermore, we release example images of boards created for our probing classifiers and videos of the behavioral tests. The code, results and examples can be found at https://bit.ly/alphatology.
Dataset Splits	No	The paper describes training and evaluating probing classifiers and mentions 'test performance', implying a train/test split for these classifiers. However, it does not explicitly state the use of a validation split or specific percentages/counts for train/validation/test splits for either the AlphaZero agent training or the probing classifiers.
Hardware Specification	Yes	We used NVIDIA Ge Force RTX 3090. The total compute across all experiments was about 24 GPU hours.
Software Dependencies	No	The paper mentions various models and techniques (Alpha Zero, MCTS, linear classifiers) but does not provide specific version numbers for any software dependencies like programming languages, libraries, or frameworks used for implementation.
Experiment Setup	No	The paper states, 'We report hyperparameters in the Supplementary Material,' indicating that specific experimental setup details, such as hyperparameter values, are not provided within the main text.