Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
On efficient multilevel Clustering via Wasserstein distances
Authors: Viet Huynh, Nhat Ho, Nhan Dam, XuanLong Nguyen, Mikhail Yurochkin, Hung Bui, Dinh Phung
JMLR 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, experimental results with both synthetic and real data are presented to demonstrate the flexibility and scalability of the proposed approach. |
| Researcher Affiliation | Collaboration | Viet Huynh EMAIL Faculty of Information Technology, Monash University; Nhat Ho EMAIL Department of Statistics and Data Sciences, University of Texas, Austin; Nhan Dam EMAIL Faculty of Information Technology, Monash University; Xuan Long Nguyen EMAIL Department of Statistics, University of Michigan; Mikhail Yurochkin EMAIL IBM Research; Hung Bui EMAIL Vin AI Research; Dinh Phung EMAIL Faculty of Information Technology, Monash University |
| Pseudocode | Yes | Algorithm 1 Multilevel Wasserstein Means (MWM); Algorithm 2 Multilevel Wasserstein Means with Sharing (MWMS); Algorithm 3 Multilevel Wasserstein Means with Context (MWMC); Algorithm 4 Multilevel Wasserstein Geometric Median (MWGM); Algorithm 5 Map Reduce for Multilevel Wasserstein Means (MWM); Algorithm 6 Wasserstein barycenter under the entropic version of W1 metric; Algorithm 7 Smoothed Primal T γ and Dual b γ Optima; Algorithm 8 Fix-support Wasserstein barycenter; Algorithm 9 Free-support Wasserstein barycenter |
| Open Source Code | Yes | Code is available at https://github.com/viethhuynh/wasserstein-means |
| Open Datasets | Yes | Label Me dataset2 consists of 2, 688 annotated images... 2. http://labelme.csail.mit.edu; Student Life dataset3 is a large dataset... 3. https://studentlife.cs.dartmouth.edu/dataset.html |
| Dataset Splits | No | The paper uses synthetic and real-world datasets for empirical studies, but does not explicitly provide training/test/validation dataset splits, percentages, or cross-validation methodologies for reproducing experiments. It describes generating synthetic data and using filtered real-world datasets (1,800 images from Label Me, 49 documents from Student Life) for analysis without specifying how these are partitioned for model evaluation. |
| Hardware Specification | Yes | All experiments are conducted on the same machine (Windows 10 64-bit, core i7 3.4GHz CPU and 16GB RAM). |
| Software Dependencies | No | The paper mentions using "Apache Spark framework" for parallel implementation and "GPU implementation of Cuturi's algorithms", but does not provide specific version numbers for these or any other key software dependencies required for reproducibility. |
| Experiment Setup | Yes | In our experiments, we used a fixed value of all entropic regularization parameters τ = 10. For the regularized term λ, we heuristically choose to balance global and local terms, i.e., λ W 2 2 (H,1/m Pm j=1 δGj )/Pm j=1 W 2 2 (Gj,P j nj ). |