Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Contamination-source based K-sample clustering

Authors: Xavier Milhaud, Denys Pommeret, Yahia Salhi, Pierre Vandekerkhove

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We prove the consistency of our approach under the assumption of the existence of true clusters and demonstrate the performances of our methodology through an extensive Monte Carlo study. Finally, we apply our methodology, implemented in the admix1 R package, to a European countries COVID-19 excess of mortality dataset, aiming to cluster countries similarly impacted by the pandemic across different age groups.
Researcher Affiliation Academia Xavier Milhaud EMAIL Aix Marseille Univ, CNRS, Centrale Marseille, I2M 13288 Marseille cedex 9, France; Denys Pommeret EMAIL Aix Marseille Univ, CNRS, Centrale Marseille, I2M 13288 Marseille cedex 9, France; Yahia Salhi EMAIL Universit e Claude Bernard Lyon 1, UCBL, ISFA LSAF EA2429 F-69007 Lyon, France; Pierre Vandekerkhove EMAIL Universit e Gustave Eiffel, LAMA (UMR 8050) 77420 Champs-sur-Marne, France
Pseudocode Yes Algorithm 1: Tuning of the parameter γ. Algorithm 2: Tuning of the parameter C. Algorithm 3: K-sample Contamination Model Clustering (KCMC).
Open Source Code Yes Finally, we apply our methodology, implemented in the admix1 R package, to a European countries COVID-19 excess of mortality dataset... See https://CRAN.R-project.org/package=admix for more information about the package on CRAN. ...All our numerical experiments were performed thanks to the R package admix2...
Open Datasets Yes Finally, we apply our methodology... to a European countries COVID-19 excess of mortality dataset... The datasets of interest came from the Short-Term Mortality Fluctuations (STMF) data series compiled by the Human Mortality Database (HMD).
Dataset Splits No The paper describes methods to split simulated data or existing data for tuning purposes (e.g., "Split randomly the ith sample X(i) into two subpopulations...") to create artificial null hypotheses. However, it does not provide explicit train/validation/test splits for the main experimental evaluation or the real-world application data.
Hardware Specification No The paper states, "All our numerical experiments were performed thanks to the R package admix2..." but does not specify any hardware details like CPU, GPU, or memory used for these experiments.
Software Dependencies No The paper mentions the use of "the admix1 R package" and "the R package admix2" but does not provide specific version numbers for the R language or the packages themselves, which are necessary for reproducible software dependencies.
Experiment Setup Yes Unless otherwise stated, all our simulations were performed with fixed values ε0 = 0.99 and ε1 = 0.75 in (16) and (17)... We set B = 20. ...we set up the level α to 1%... The known distributions are multinomial ones with four categories here and we compare the unknown multinomial distributions caused by the COVID-19.