A Step Toward Quantifying Independently Reproducible Machine Learning Research
Authors: Edward Raff
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We take the first step toward a quantifiable answer by manually attempting to implement 255 papers published from 1984 until 2017, recording features of each paper, and performing statistical analysis of the results. |
| Researcher Affiliation | Collaboration | Edward Raff Booz Allen Hamilton raff_edward@bah.com University of Maryland, Baltimore County raff.edward@umbc.edu |
| Pseudocode | No | The paper discusses the presence of pseudocode in the papers it analyzes but does not provide pseudocode for its own research methodology. |
| Open Source Code | No | The paper provides a link to its dataset, but does not provide a link or explicit statement about the availability of the source code for its statistical analysis or methodology. |
| Open Datasets | Yes | An anonymized version of the data can be found at https://github.com/Edward Raff/Quantifying-Independently Reproducible-ML. |
| Dataset Splits | No | The paper does not provide specific details about dataset splits (e.g., training, validation, test splits) for its own study's data analysis. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (CPU, GPU models, memory) used for its own statistical analysis or study. |
| Software Dependencies | Yes | JASP was used to compute all statistical tests [16]. |
| Experiment Setup | Yes | For each numeric feature (except the number of pages and number of authors), we normalized the value by the number of pages in the paper. For numeric features we used the non-parametric Mann Whitney U [10] test to determine significance... For all categorical features, we used a Chi-Squared test [12] with continuity correction [13]. In our analysis we will also examine relationships between some of our categorical features and other numeric features for suspected relationships. We will continue to use nonparametric tests for robustness/conservative estimates of significance, relying on the Kruskal-Walls [14] for ANOVA testing and the Dunn test [15] for post-hoc analysis. |