Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Contamination-source based K-sample clustering

Authors: Xavier Milhaud, Denys Pommeret, Yahia Salhi, Pierre Vandekerkhove

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We prove the consistency of our approach under the assumption of the existence of true clusters and demonstrate the performances of our methodology through an extensive Monte Carlo study. Finally, we apply our methodology, implemented in the admix1 R package, to a European countries COVID-19 excess of mortality dataset, aiming to cluster countries similarly impacted by the pandemic across diﬀerent age groups.
Researcher Affiliation	Academia	Xavier Milhaud EMAIL Aix Marseille Univ, CNRS, Centrale Marseille, I2M 13288 Marseille cedex 9, France; Denys Pommeret EMAIL Aix Marseille Univ, CNRS, Centrale Marseille, I2M 13288 Marseille cedex 9, France; Yahia Salhi EMAIL Universit e Claude Bernard Lyon 1, UCBL, ISFA LSAF EA2429 F-69007 Lyon, France; Pierre Vandekerkhove EMAIL Universit e Gustave Eiﬀel, LAMA (UMR 8050) 77420 Champs-sur-Marne, France
Pseudocode	Yes	Algorithm 1: Tuning of the parameter γ. Algorithm 2: Tuning of the parameter C. Algorithm 3: K-sample Contamination Model Clustering (KCMC).
Open Source Code	Yes	Finally, we apply our methodology, implemented in the admix1 R package, to a European countries COVID-19 excess of mortality dataset... See https://CRAN.R-project.org/package=admix for more information about the package on CRAN. ...All our numerical experiments were performed thanks to the R package admix2...
Open Datasets	Yes	Finally, we apply our methodology... to a European countries COVID-19 excess of mortality dataset... The datasets of interest came from the Short-Term Mortality Fluctuations (STMF) data series compiled by the Human Mortality Database (HMD).
Dataset Splits	No	The paper describes methods to split simulated data or existing data for tuning purposes (e.g., "Split randomly the ith sample X(i) into two subpopulations...") to create artificial null hypotheses. However, it does not provide explicit train/validation/test splits for the main experimental evaluation or the real-world application data.
Hardware Specification	No	The paper states, "All our numerical experiments were performed thanks to the R package admix2..." but does not specify any hardware details like CPU, GPU, or memory used for these experiments.
Software Dependencies	No	The paper mentions the use of "the admix1 R package" and "the R package admix2" but does not provide specific version numbers for the R language or the packages themselves, which are necessary for reproducible software dependencies.
Experiment Setup	Yes	Unless otherwise stated, all our simulations were performed with ﬁxed values ε0 = 0.99 and ε1 = 0.75 in (16) and (17)... We set B = 20. ...we set up the level α to 1%... The known distributions are multinomial ones with four categories here and we compare the unknown multinomial distributions caused by the COVID-19.