Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Should We Really Use Post-Hoc Tests Based on Mean-Ranks?
Authors: Alessio Benavoli, Giorgio Corani, Francesca Mangili
JMLR 2016 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We illustrate the inconsistencies the mean-ranks test by presenting three examples. All examples refer to the analysis of the accuracy of different classifiers on multiple data sets. ... Example 3: Real Classifiers on UCI Data Sets. Finally, we compare the accuracies of seven classifiers on 54 datasets. |
| Researcher Affiliation | Academia | Alessio Benavoli EMAIL Giorgio Corani EMAIL Francesca Mangili EMAIL Istituto Dalle Molle di Studi sull Intelligenza Artificiale (IDSIA) Scuola Universitaria Professionale della Svizzera italiana (SUPSI) Universit a della Svizzera italiana (USI) Manno, Switzerland |
| Pseudocode | No | The paper describes mathematical formulas and statistical tests but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The MATLAB scripts of the above examples can be downloaded from ipg.idsia.ch/software/meanRanks/matlab.zip |
| Open Datasets | Yes | Example 3: Real Classifiers on UCI Data Sets. Finally, we compare the accuracies of seven classifiers on 54 datasets. The accuracies are reported in Table 2. |
| Dataset Splits | Yes | Each classifier has been assessed via 10 runs of 10-folds cross-validation. |
| Hardware Specification | No | The paper does not provide any specific hardware details such as GPU or CPU models used for running experiments. |
| Software Dependencies | No | We performed all the experiments using WEKA.2 ... The MATLAB scripts of the above examples can be downloaded from ipg.idsia.ch/software/meanRanks/matlab.zip. No version numbers are specified for WEKA or MATLAB. |
| Experiment Setup | No | The paper details settings for statistical comparisons (e.g., Bonferroni correction, significance levels, p-values, 10 runs of 10-folds cross-validation for evaluation), but it does not specify hyperparameters (like learning rate, batch size, optimizer) or system-level training settings for the machine learning classifiers themselves. |