Re-evaluating evaluation
Authors: David Balduzzi, Karl Tuyls, Julien Perolat, Thore Graepel
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Progress in machine learning is measured by careful evaluation on problems of outstanding common interest. However, the proliferation of benchmark suites and environments, adversarial attacks, and other complications has diluted the basic evaluation model by overwhelming researchers with choices. Deliberate or accidental cherry picking is increasingly likely, and designing well-balanced evaluation suites requires increasing effort. |
| Researcher Affiliation | Industry | Deep Mind. Email: { dbalduzzi | karltuyls | perolat | thore }@google.com |
| Pseudocode | Yes | Proofs and code are in the appendix. Section E contains Algorithm 1 (m Elo Update) and Algorithm 2 (Nash Evaluation Update). |
| Open Source Code | Yes | Proofs and code are in the appendix. |
| Open Datasets | Yes | We compared the predictive capabilities of Elo and the simplest extension m Elo2 on eight Go algorithms taken from extended data table 9 in [24]: seven variants of Alpha Go, and Zen. To illustrate the method, we re-evaluate the performance of agents on Atari [2]. Data is taken from results published in [67 70]. |
| Dataset Splits | No | The paper evaluates existing results on Go and Atari and does not describe specific train/validation/test splits used for its own analysis or model training. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used for running the experiments or analysis. |
| Software Dependencies | No | The paper mentions using an "LP-solver" but does not specify its version or any other software dependencies with version numbers. |
| Experiment Setup | No | The paper describes the meta-game and how Nash equilibria are found, stating "We find a Nash equilibrium using an LP-solver". However, it does not provide specific experimental setup details such as hyperparameters (e.g., learning rates, batch sizes, number of epochs) or detailed configuration settings for any models or algorithms trained by the authors for the reported results. |