reproducibilityindex.ai

Re-evaluating evaluation

Authors: David Balduzzi, Karl Tuyls, Julien Perolat, Thore Graepel

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Progress in machine learning is measured by careful evaluation on problems of outstanding common interest. However, the proliferation of benchmark suites and environments, adversarial attacks, and other complications has diluted the basic evaluation model by overwhelming researchers with choices. Deliberate or accidental cherry picking is increasingly likely, and designing well-balanced evaluation suites requires increasing effort.
Researcher Affiliation	Industry	Deep Mind. Email: { dbalduzzi \| karltuyls \| perolat \| thore }@google.com
Pseudocode	Yes	Proofs and code are in the appendix. Section E contains Algorithm 1 (m Elo Update) and Algorithm 2 (Nash Evaluation Update).
Open Source Code	Yes	Proofs and code are in the appendix.
Open Datasets	Yes	We compared the predictive capabilities of Elo and the simplest extension m Elo2 on eight Go algorithms taken from extended data table 9 in [24]: seven variants of Alpha Go, and Zen. To illustrate the method, we re-evaluate the performance of agents on Atari [2]. Data is taken from results published in [67 70].
Dataset Splits	No	The paper evaluates existing results on Go and Atari and does not describe specific train/validation/test splits used for its own analysis or model training.
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used for running the experiments or analysis.
Software Dependencies	No	The paper mentions using an "LP-solver" but does not specify its version or any other software dependencies with version numbers.
Experiment Setup	No	The paper describes the meta-game and how Nash equilibria are found, stating "We find a Nash equilibrium using an LP-solver". However, it does not provide specific experimental setup details such as hyperparameters (e.g., learning rates, batch sizes, number of epochs) or detailed configuration settings for any models or algorithms trained by the authors for the reported results.