Statistical and Computational Trade-Offs in Kernel K-Means
Authors: Daniele Calandriello, Lorenzo Rosasco
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We now evaluate experimentally the claims of Theorem 1, namely that sampling e O(n/γ) increases the excess risk by an extra γ/n factor, and that m = n is sufficient to recover the optimal rate. We use the Nystroem and Mini Batch Kmeans classes from the sklearn python library [28]to implement kernel k-means with Nyström embedding (Algorithm 1) and we compute the solution C++ n,m. For our experiments we follow the same approach as Wang et al. [35], and test our algorithm on two variants of the MNIST digit dataset. |
| Researcher Affiliation | Academia | Daniele Calandriello LCSL IIT & MIT, Genoa, Italy Lorenzo Rosasco University of Genoa, LCSL IIT & MIT |
| Pseudocode | Yes | Algorithm 1: Nyström Kernel K-Means |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code for the described methodology or provide a link to a code repository. |
| Open Datasets | Yes | We use the Nystroem and Mini Batch Kmeans classes from the sklearn python library [28]to implement kernel k-means with Nyström embedding (Algorithm 1) and we compute the solution C++ n,m. For our experiments we follow the same approach as Wang et al. [35], and test our algorithm on two variants of the MNIST digit dataset. In particular, MNIST60K [20] is the original MNIST dataset containing pictures each with d = 784 pixels. ... To test the scalability of our approach we also consider the MNIST8M dataset from the infinite MNIST project [23]... [20] Yann Le Cun and Corinna Cortes. MNIST handwritten digit database. 2010. URL http: //yann.lecun.com/exdb/mnist/. [23] Gaëlle Loosli, Stéphane Canu, and Léon Bottou. Training invariant support vector machines using selective sampling. In Large Scale Kernel Machines, pages 301 320. MIT Press, Cambridge, MA., 2007. |
| Dataset Splits | Yes | We split the dataset in two part, n = 60000 samples are used to compute the W(C++ n,m) centroids, and we leave out unseen 10000 samples to compute W(C++ n,m, µtest), as a proxy for W(C++ n,m, µ). To test the scalability of our approach we also consider the MNIST8M dataset from the infinite MNIST project [23], constructed using non-trivial transformations and corruptions of the original MNIST60K images. Here we compute C++ n,m using n = 8000000 images, and compute W(C++ n,m, µtest) on 100000 unseen images. |
| Hardware Specification | Yes | MNIST60K: these experiments are small enough to run in less than a minute on a single laptop with 4 cores and 8GB of RAM. MNIST8M: to test the scalability of our approach, we run the same experiment on millions of points. Note that we carry out our MNIST8M experiment on a single 36 core machine with 128GB of RAM... We gratefully acknowledge the support of NVIDIA Corporation for the donation of the Titan Xp GPUs and the Tesla k40 GPU used for this research. |
| Software Dependencies | No | The paper mentions using the "Nystroem and Mini Batch Kmeans classes from the sklearn python library [28]" but does not specify a version number for scikit-learn or Python, which is required for reproducibility. |
| Experiment Setup | Yes | As in Wang et al. [35] we use Gaussian kernel with bandwidth σ = (1/n2) q P i,j xi xj 2. Finally, when k-means++ initialization is used, the expected number of iterations required for Lloyd s algorithm to converge is only logarithmic [1]. |