Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

On efficient multilevel Clustering via Wasserstein distances

Authors: Viet Huynh, Nhat Ho, Nhan Dam, XuanLong Nguyen, Mikhail Yurochkin, Hung Bui, Dinh Phung

JMLR 2021 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, experimental results with both synthetic and real data are presented to demonstrate the flexibility and scalability of the proposed approach.
Researcher Affiliation Collaboration Viet Huynh EMAIL Faculty of Information Technology, Monash University; Nhat Ho EMAIL Department of Statistics and Data Sciences, University of Texas, Austin; Nhan Dam EMAIL Faculty of Information Technology, Monash University; Xuan Long Nguyen EMAIL Department of Statistics, University of Michigan; Mikhail Yurochkin EMAIL IBM Research; Hung Bui EMAIL Vin AI Research; Dinh Phung EMAIL Faculty of Information Technology, Monash University
Pseudocode Yes Algorithm 1 Multilevel Wasserstein Means (MWM); Algorithm 2 Multilevel Wasserstein Means with Sharing (MWMS); Algorithm 3 Multilevel Wasserstein Means with Context (MWMC); Algorithm 4 Multilevel Wasserstein Geometric Median (MWGM); Algorithm 5 Map Reduce for Multilevel Wasserstein Means (MWM); Algorithm 6 Wasserstein barycenter under the entropic version of W1 metric; Algorithm 7 Smoothed Primal T γ and Dual b γ Optima; Algorithm 8 Fix-support Wasserstein barycenter; Algorithm 9 Free-support Wasserstein barycenter
Open Source Code Yes Code is available at https://github.com/viethhuynh/wasserstein-means
Open Datasets Yes Label Me dataset2 consists of 2, 688 annotated images... 2. http://labelme.csail.mit.edu; Student Life dataset3 is a large dataset... 3. https://studentlife.cs.dartmouth.edu/dataset.html
Dataset Splits No The paper uses synthetic and real-world datasets for empirical studies, but does not explicitly provide training/test/validation dataset splits, percentages, or cross-validation methodologies for reproducing experiments. It describes generating synthetic data and using filtered real-world datasets (1,800 images from Label Me, 49 documents from Student Life) for analysis without specifying how these are partitioned for model evaluation.
Hardware Specification Yes All experiments are conducted on the same machine (Windows 10 64-bit, core i7 3.4GHz CPU and 16GB RAM).
Software Dependencies No The paper mentions using "Apache Spark framework" for parallel implementation and "GPU implementation of Cuturi's algorithms", but does not provide specific version numbers for these or any other key software dependencies required for reproducibility.
Experiment Setup Yes In our experiments, we used a fixed value of all entropic regularization parameters τ = 10. For the regularized term λ, we heuristically choose to balance global and local terms, i.e., λ W 2 2 (H,1/m Pm j=1 δGj )/Pm j=1 W 2 2 (Gj,P j nj ).