A Step Toward Quantifying Independently Reproducible Machine Learning Research

Authors: Edward Raff

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We take the first step toward a quantifiable answer by manually attempting to implement 255 papers published from 1984 until 2017, recording features of each paper, and performing statistical analysis of the results.
Researcher Affiliation Collaboration Edward Raff Booz Allen Hamilton raff_edward@bah.com University of Maryland, Baltimore County raff.edward@umbc.edu
Pseudocode No The paper discusses the presence of pseudocode in the papers it analyzes but does not provide pseudocode for its own research methodology.
Open Source Code No The paper provides a link to its dataset, but does not provide a link or explicit statement about the availability of the source code for its statistical analysis or methodology.
Open Datasets Yes An anonymized version of the data can be found at https://github.com/Edward Raff/Quantifying-Independently Reproducible-ML.
Dataset Splits No The paper does not provide specific details about dataset splits (e.g., training, validation, test splits) for its own study's data analysis.
Hardware Specification No The paper does not explicitly describe the specific hardware (CPU, GPU models, memory) used for its own statistical analysis or study.
Software Dependencies Yes JASP was used to compute all statistical tests [16].
Experiment Setup Yes For each numeric feature (except the number of pages and number of authors), we normalized the value by the number of pages in the paper. For numeric features we used the non-parametric Mann Whitney U [10] test to determine significance... For all categorical features, we used a Chi-Squared test [12] with continuity correction [13]. In our analysis we will also examine relationships between some of our categorical features and other numeric features for suspected relationships. We will continue to use nonparametric tests for robustness/conservative estimates of significance, relying on the Kruskal-Walls [14] for ANOVA testing and the Dunn test [15] for post-hoc analysis.