Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

A Bayesian Wilcoxon signed-rank test based on the Dirichlet process

Authors: Alessio Benavoli, Giorgio Corani, Francesca Mangili, Marco Zaffalon, Fabrizio Ruggeri

ICML 2014 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show results dealing with the comparison of two classiﬁers using real and simulated data. By means of simulations on artiﬁcial and real world data, we use our test to decide if a certain classiﬁer is signiﬁcantly better than another.
Researcher Affiliation	Academia	IPG IDSIA, Manno, Switzerland and CNR IMATI, Milano, Italy
Pseudocode	No	The paper presents mathematical formulas and theorems but does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	The IDP test developed in this work can currently be used online (or downloaded as R or Matlab code) at http://ipg.idsia.ch/software/IDP.php.
Open Datasets	Yes	We run the WEKA implementation (Witten et al., 2011) of such classiﬁers on 70 data sets from the UCI repository: 54 classiﬁcation data sets and 16 regression data sets, which we use for classiﬁcation having discretized into 4 bins the target variable.
Dataset Splits	Yes	We evaluate via 10 folds cross-validation the accuracy of each classiﬁer on each data set.
Hardware Specification	No	The paper describes experimental setup involving numerical simulations and the use of the WEKA tool, but does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used for these experiments.
Software Dependencies	No	The paper mentions 'WEKA implementation (Witten et al., 2011)' and states the IDP test can be downloaded as 'R or Matlab code,' but it does not specify concrete version numbers for WEKA, R, Matlab, or any other software dependencies.
Experiment Setup	Yes	Consider a Monte Carlo experiment in which paired values of accuracies Xi, Yi are generated for n = 30 multiple data sets based on the Gaussian models: Xi Yi for i = 1,...,n, with (difference in accuracy) ranging from 0.07 to 0.07 and σ = 0.12. ... The one-sided Wilcoxon test has been implemented according to the conventional decision criterion: p-value less than α = 0.05.