Nested Mini-Batch K-Means

Authors: James Newling, François Fleuret

NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that the resulting nmbatch algorithm is very effective, often arriving within 1% of the empirical minimum 100 earlier than the standard mini-batch algorithm. We have performed experiments on 3 dense datasets and sparse dataset used in Sculley (2010).
Researcher Affiliation Academia James Newling Idiap Research Institue & EPFL james.newling@idiap.ch Franc ois Fleuret Idiap Research Institue & EPFL francois.fleuret@idiap.ch
Pseudocode Yes Algorithm 1 assignment-with-bounds(i), Algorithm 2 initialise-c-S-v, Algorithm 3 accumulate(i), Algorithm 4 mbatch, Algorithm 5 nmbatch.
Open Source Code Yes Our C++ and Python code is available at https: //github.com/idiap/eakmeans.
Open Datasets Yes The INFMNIST dataset (Loosli et al., 2007) is an extension of MNIST..., STL10P (Coates et al., 2011) consists of..., KDDC98 contains..., RCV1 dataset of Lewis et al. (2004) consists of data...
Dataset Splits Yes We use 400,000 such digits for performing k-means and 40,000 for computing a validation energy EV . we train with 960,000 patches and use 40,000 for validation. KDDC98 contains 75,000 training samples and 20,000 validation samples.
Hardware Specification No Experiments were all single threaded. The paper does not provide specific hardware details such as CPU/GPU models or memory amounts used for running experiments.
Software Dependencies No The paper mentions implementations in C++ and Python, and compares against 'scikit-learn' and 'sofia' implementations, but it does not specify version numbers for these or any other software dependencies.
Experiment Setup Yes The batchsize for mbatch and initial batchsize for nmbatch are 5, 000, and k = 50 clusters. for 20 random seeds, the training dataset is shuffled and the first k datapoints are taken as initialising centroids. Then, for each of the algorithms, k-means is run on the shuffled training set. At regular intervals, a validation energy EV is computed on the validation set.