Coresets for Nonparametric Estimation - the Case of DP-Means
Authors: Olivier Bachem, Mario Lucic, Andreas Krause
ICML 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically demonstrate that our algorithm allows us to efficiently trade off computation time and approximation error and thus scale DP-Means to large datasets. |
| Researcher Affiliation | Academia | ETH Zurich, Switzerland |
| Pseudocode | Yes | Algorithm 1 DP-Means++ |
| Open Source Code | No | The paper does not provide an explicit statement about releasing its source code, nor does it provide a link to a code repository for the methodology described. |
| Open Datasets | Yes | USGS (United States Geological Survey, 2010) locations of 59 209 earthquakes between 1972 and 2010 mapped to 3D space using WGS 84. CSN (Faulkner et al., 2011) 7GB of cellphone accelerometer data processed into 80 000 observations and 17 features. KDD (KDD Cup 2004, 2004) 145 751 samples with 74 features measuring the match between a protein and a native sequence. MSYP (Bertin-Mahieux et al., 2011) 90 features from 515 345 songs of the Million Song datasets used for predicting the year of songs. MNIST (Le Cun et al., 1998) 70 000 images of handwritten digits of size 28 28 pixels transformed using randomized PCA with whitening to 10 dimensions. KDD Cup 2004. Protein Homology Dataset. Available at http://osmot.cs.cornell.edu/kddcup/ datasets.html, 2004. |
| Dataset Splits | No | The paper discusses using subsamples and solving the DP-Means problem on them, but does not explicitly describe train/validation/test dataset splits or cross-validation for the listed datasets. |
| Hardware Specification | Yes | All experiments were run on an Intel Xeon machine with 24 2.9GHz processors and 256GB RAM. |
| Software Dependencies | No | The paper mentions algorithms like K-Means++ and LLoyd's algorithm, but does not provide specific version numbers for any software dependencies or libraries used in the implementation. |
| Experiment Setup | Yes | We solve the K-Means clustering problem for different values of k chosen from a logarithmic grid of 20 points between 2 1 k and 22 k (see Table 1 for values of k). For each dataset we then select a value λ from this range (see Table 1) and roughly estimate the number of clusters k in the optimal solution from the K-Means results. |