reproducibilityindex.ai

Nested Mini-Batch K-Means

Authors: James Newling, François Fleuret

NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that the resulting nmbatch algorithm is very effective, often arriving within 1% of the empirical minimum 100 earlier than the standard mini-batch algorithm. We have performed experiments on 3 dense datasets and sparse dataset used in Sculley (2010).
Researcher Affiliation	Academia	James Newling Idiap Research Institue & EPFL james.newling@idiap.ch Franc ois Fleuret Idiap Research Institue & EPFL francois.fleuret@idiap.ch
Pseudocode	Yes	Algorithm 1 assignment-with-bounds(i), Algorithm 2 initialise-c-S-v, Algorithm 3 accumulate(i), Algorithm 4 mbatch, Algorithm 5 nmbatch.
Open Source Code	Yes	Our C++ and Python code is available at https: //github.com/idiap/eakmeans.
Open Datasets	Yes	The INFMNIST dataset (Loosli et al., 2007) is an extension of MNIST..., STL10P (Coates et al., 2011) consists of..., KDDC98 contains..., RCV1 dataset of Lewis et al. (2004) consists of data...
Dataset Splits	Yes	We use 400,000 such digits for performing k-means and 40,000 for computing a validation energy EV . we train with 960,000 patches and use 40,000 for validation. KDDC98 contains 75,000 training samples and 20,000 validation samples.
Hardware Specification	No	Experiments were all single threaded. The paper does not provide specific hardware details such as CPU/GPU models or memory amounts used for running experiments.
Software Dependencies	No	The paper mentions implementations in C++ and Python, and compares against 'scikit-learn' and 'soﬁa' implementations, but it does not specify version numbers for these or any other software dependencies.
Experiment Setup	Yes	The batchsize for mbatch and initial batchsize for nmbatch are 5, 000, and k = 50 clusters. for 20 random seeds, the training dataset is shufﬂed and the ﬁrst k datapoints are taken as initialising centroids. Then, for each of the algorithms, k-means is run on the shufﬂed training set. At regular intervals, a validation energy EV is computed on the validation set.