Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
A Step Toward Quantifying Independently Reproducible Machine Learning Research
Authors: Edward Raff
NeurIPS 2019 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We take the first step toward a quantifiable answer by manually attempting to implement 255 papers published from 1984 until 2017, recording features of each paper, and performing statistical analysis of the results. |
| Researcher Affiliation | Collaboration | Edward Raff Booz Allen Hamilton EMAIL University of Maryland, Baltimore County EMAIL |
| Pseudocode | No | The paper discusses the presence of pseudocode in the papers it analyzes but does not provide pseudocode for its own research methodology. |
| Open Source Code | No | The paper provides a link to its dataset, but does not provide a link or explicit statement about the availability of the source code for its statistical analysis or methodology. |
| Open Datasets | Yes | An anonymized version of the data can be found at https://github.com/Edward Raff/Quantifying-Independently Reproducible-ML. |
| Dataset Splits | No | The paper does not provide specific details about dataset splits (e.g., training, validation, test splits) for its own study's data analysis. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (CPU, GPU models, memory) used for its own statistical analysis or study. |
| Software Dependencies | Yes | JASP was used to compute all statistical tests [16]. |
| Experiment Setup | Yes | For each numeric feature (except the number of pages and number of authors), we normalized the value by the number of pages in the paper. For numeric features we used the non-parametric Mann Whitney U [10] test to determine significance... For all categorical features, we used a Chi-Squared test [12] with continuity correction [13]. In our analysis we will also examine relationships between some of our categorical features and other numeric features for suspected relationships. We will continue to use nonparametric tests for robustness/conservative estimates of significance, relying on the Kruskal-Walls [14] for ANOVA testing and the Dunn test [15] for post-hoc analysis. |