reproducibilityindex.ai

Coresets for Nonparametric Estimation - the Case of DP-Means

Authors: Olivier Bachem, Mario Lucic, Andreas Krause

ICML 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically demonstrate that our algorithm allows us to efﬁciently trade off computation time and approximation error and thus scale DP-Means to large datasets.
Researcher Affiliation	Academia	ETH Zurich, Switzerland
Pseudocode	Yes	Algorithm 1 DP-Means++
Open Source Code	No	The paper does not provide an explicit statement about releasing its source code, nor does it provide a link to a code repository for the methodology described.
Open Datasets	Yes	USGS (United States Geological Survey, 2010) locations of 59 209 earthquakes between 1972 and 2010 mapped to 3D space using WGS 84. CSN (Faulkner et al., 2011) 7GB of cellphone accelerometer data processed into 80 000 observations and 17 features. KDD (KDD Cup 2004, 2004) 145 751 samples with 74 features measuring the match between a protein and a native sequence. MSYP (Bertin-Mahieux et al., 2011) 90 features from 515 345 songs of the Million Song datasets used for predicting the year of songs. MNIST (Le Cun et al., 1998) 70 000 images of handwritten digits of size 28 28 pixels transformed using randomized PCA with whitening to 10 dimensions. KDD Cup 2004. Protein Homology Dataset. Available at http://osmot.cs.cornell.edu/kddcup/ datasets.html, 2004.
Dataset Splits	No	The paper discusses using subsamples and solving the DP-Means problem on them, but does not explicitly describe train/validation/test dataset splits or cross-validation for the listed datasets.
Hardware Specification	Yes	All experiments were run on an Intel Xeon machine with 24 2.9GHz processors and 256GB RAM.
Software Dependencies	No	The paper mentions algorithms like K-Means++ and LLoyd's algorithm, but does not provide specific version numbers for any software dependencies or libraries used in the implementation.
Experiment Setup	Yes	We solve the K-Means clustering problem for different values of k chosen from a logarithmic grid of 20 points between 2 1 k and 22 k (see Table 1 for values of k). For each dataset we then select a value λ from this range (see Table 1) and roughly estimate the number of clusters k in the optimal solution from the K-Means results.