Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Achieving Optimal Clustering in Gaussian Mixture Models with Anisotropic Covariance Structures

Authors: Xin Chen, Anderson Ye Zhang

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 Numerical Studies In this section, we compare the performance of our methods with other popular clustering methods on synthetic and real datasets under different settings.
Researcher Affiliation Academia Xin Chen Princeton University EMAIL Anderson Ye Zhang University of Pennsylvania EMAIL
Pseudocode Yes Algorithm 1: Adjusted Lloyd s Algorithm for Model 1. and Algorithm 2: Adjusted Lloyd s Algorithm for Model 2.
Open Source Code No The paper does not contain any explicit statement about making its source code available or a direct link to a code repository.
Open Datasets Yes To further demonstrate the effectiveness of our methods, we conduct experiments using the Fashion-MNIST dataset [23].
Dataset Splits No The paper conducts numerical studies on synthetic and real datasets (Fashion-MNIST) but does not specify the explicit training, validation, and test dataset splits used for these experiments.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used to conduct the experiments.
Software Dependencies No The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, specific libraries).
Experiment Setup Yes In this section, we compare the performance of our methods with other popular clustering methods on synthetic and real datasets under different settings. ... We independently generate n = 1200 samples with dimension d = 50 from k = 30 clusters. Each cluster has 40 samples. We set Σ = U T ΛU, where Λ is a 50 50 diagonal matrix with diagonal elements selected from 0.5 to 8 with equal space and U is a randomly generated orthogonal matrix. The centers {θ a}a [n] are orthogonal to each other with θ 1 = . . . = θ 30 = 9. and In this case, we take n = 1200, k = 2, and d = 9. We set Σ 1 = Id and Σ 2 = Λ2, a diagonal matrix where the first diagonal entry is 0.5 and the remaining entries are 5. We set the cluster sizes to be 900 and 300, respectively. To simplify the calculation of SNR , we set θ 1 = 0 and θ 2 = 5e1... and Additionally, the dashed lines in the left and right panels represent the optimal exponents SNR2/8 and SNR 2/8 of the minimax bounds, respectively. It is observed that both Algorithm 1 and Algorithm 2 meet these benchmarks after three iterations. and we apply PCA to reduce dimensionality from 784 to 50 by retaining the top 50 principal components.