Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Should We Really Use Post-Hoc Tests Based on Mean-Ranks?

Authors: Alessio Benavoli, Giorgio Corani, Francesca Mangili

JMLR 2016 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We illustrate the inconsistencies the mean-ranks test by presenting three examples. All examples refer to the analysis of the accuracy of diﬀerent classiﬁers on multiple data sets. ... Example 3: Real Classiﬁers on UCI Data Sets. Finally, we compare the accuracies of seven classiﬁers on 54 datasets.
Researcher Affiliation	Academia	Alessio Benavoli EMAIL Giorgio Corani EMAIL Francesca Mangili EMAIL Istituto Dalle Molle di Studi sull Intelligenza Artiﬁciale (IDSIA) Scuola Universitaria Professionale della Svizzera italiana (SUPSI) Universit a della Svizzera italiana (USI) Manno, Switzerland
Pseudocode	No	The paper describes mathematical formulas and statistical tests but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The MATLAB scripts of the above examples can be downloaded from ipg.idsia.ch/software/meanRanks/matlab.zip
Open Datasets	Yes	Example 3: Real Classiﬁers on UCI Data Sets. Finally, we compare the accuracies of seven classiﬁers on 54 datasets. The accuracies are reported in Table 2.
Dataset Splits	Yes	Each classiﬁer has been assessed via 10 runs of 10-folds cross-validation.
Hardware Specification	No	The paper does not provide any specific hardware details such as GPU or CPU models used for running experiments.
Software Dependencies	No	We performed all the experiments using WEKA.2 ... The MATLAB scripts of the above examples can be downloaded from ipg.idsia.ch/software/meanRanks/matlab.zip. No version numbers are specified for WEKA or MATLAB.
Experiment Setup	No	The paper details settings for statistical comparisons (e.g., Bonferroni correction, significance levels, p-values, 10 runs of 10-folds cross-validation for evaluation), but it does not specify hyperparameters (like learning rate, batch size, optimizer) or system-level training settings for the machine learning classifiers themselves.