Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Should We Really Use Post-Hoc Tests Based on Mean-Ranks?

Authors: Alessio Benavoli, Giorgio Corani, Francesca Mangili

JMLR 2016 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We illustrate the inconsistencies the mean-ranks test by presenting three examples. All examples refer to the analysis of the accuracy of different classifiers on multiple data sets. ... Example 3: Real Classifiers on UCI Data Sets. Finally, we compare the accuracies of seven classifiers on 54 datasets.
Researcher Affiliation Academia Alessio Benavoli EMAIL Giorgio Corani EMAIL Francesca Mangili EMAIL Istituto Dalle Molle di Studi sull Intelligenza Artificiale (IDSIA) Scuola Universitaria Professionale della Svizzera italiana (SUPSI) Universit a della Svizzera italiana (USI) Manno, Switzerland
Pseudocode No The paper describes mathematical formulas and statistical tests but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes The MATLAB scripts of the above examples can be downloaded from ipg.idsia.ch/software/meanRanks/matlab.zip
Open Datasets Yes Example 3: Real Classifiers on UCI Data Sets. Finally, we compare the accuracies of seven classifiers on 54 datasets. The accuracies are reported in Table 2.
Dataset Splits Yes Each classifier has been assessed via 10 runs of 10-folds cross-validation.
Hardware Specification No The paper does not provide any specific hardware details such as GPU or CPU models used for running experiments.
Software Dependencies No We performed all the experiments using WEKA.2 ... The MATLAB scripts of the above examples can be downloaded from ipg.idsia.ch/software/meanRanks/matlab.zip. No version numbers are specified for WEKA or MATLAB.
Experiment Setup No The paper details settings for statistical comparisons (e.g., Bonferroni correction, significance levels, p-values, 10 runs of 10-folds cross-validation for evaluation), but it does not specify hyperparameters (like learning rate, batch size, optimizer) or system-level training settings for the machine learning classifiers themselves.