Evaluation beyond Task Performance: Analyzing Concepts in AlphaZero in Hex
Authors: Charles Lovering, Jessica Forde, George Konidaris, Ellie Pavlick, Michael Littman
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We investigate Alpha Zero s internal representations in the game of Hex using two evaluation techniques from natural language processing (NLP): model probing and behavioral tests. In doing so, we introduce new evaluation tools to the RL community, and illustrate how evaluations other than task performance can be used to provide a more complete picture of a model s strengths and weaknesses. The total compute across all experiments was about 24 GPU hours. |
| Researcher Affiliation | Academia | Charles Lovering Jessica Zosa Forde George Konidaris Ellie Pavlick Michael L. Littman Department of Computer Science Brown University {first}_{last}@brown.edu |
| Pseudocode | No | The paper describes methods and processes in text but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code and results are publicly available [38]. Furthermore, we release example images of boards created for our probing classifiers and videos of the behavioral tests. The code, results and examples can be found at https://bit.ly/alphatology. Our repository is also available on Git Hub at https://github.com/jzf2101/alphatology. |
| Open Datasets | Yes | Our code and results are publicly available [38]. Furthermore, we release example images of boards created for our probing classifiers and videos of the behavioral tests. The code, results and examples can be found at https://bit.ly/alphatology. |
| Dataset Splits | No | The paper describes training and evaluating probing classifiers and mentions 'test performance', implying a train/test split for these classifiers. However, it does not explicitly state the use of a validation split or specific percentages/counts for train/validation/test splits for either the AlphaZero agent training or the probing classifiers. |
| Hardware Specification | Yes | We used NVIDIA Ge Force RTX 3090. The total compute across all experiments was about 24 GPU hours. |
| Software Dependencies | No | The paper mentions various models and techniques (Alpha Zero, MCTS, linear classifiers) but does not provide specific version numbers for any software dependencies like programming languages, libraries, or frameworks used for implementation. |
| Experiment Setup | No | The paper states, 'We report hyperparameters in the Supplementary Material,' indicating that specific experimental setup details, such as hyperparameter values, are not provided within the main text. |