Two-way kernel matrix puncturing: towards resource-efficient PCA and spectral clustering

Authors: Romain Couillet, Florent Chatelain, Nicolas Le Bihan

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We notably prove, and empirically confirm on various real image databases, that it is possible to drastically puncture the data, thereby providing possibly huge computational and storage gains, for a virtually constant (clustering or PCA) performance. This preliminary study opens as such the path towards rethinking, from a large dimensional standpoint, computational and storage costs in elementary machine learning models. 4. simulations on Fashion-MNIST and Big GAN generated images qualitatively (and partially quantitatively) confirm our theoretical findings, justifying the possibility to drastically reduce computational cost with virtually no impairment on classification performance.
Researcher Affiliation Academia 1GIPSA-lab, CNRS, Grenoble-INP, University Grenoble-Alps 2Centrale Sup elec, University Paris Saclay.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes All codes to reproduce our figures are available in the gitlab repository https://gricad-gitlab.univ-grenoble-alpes.fr/ chatelaf/two-way-kernel-matrix-puncturing.
Open Datasets Yes simulations on Fashion-MNIST and Big GAN generated images. The paper cites Brock et al., 2018 for Big GAN images.
Dataset Splits No The paper mentions synthetic data with specific `n` and `p` values, and real-world image datasets (Fashion-MNIST, Big GAN) with varying `n` values, but does not provide explicit training, validation, or test splits (e.g., percentages or counts).
Hardware Specification No The paper does not provide any specific hardware details used for running the experiments.
Software Dependencies No The paper does not specify any software dependencies with version numbers.
Experiment Setup Yes Two puncturing approaches are compared: (i) reducing the cost of the inner products x T i xj using a 5-fold (εS = .2 while εB = 1) random puncturing of the data vectors xi, versus (ii) a 25-fold puncturing of the matrix 1 p XTX (εB = .04 while εS = 1). The reported scenario is interesting in that we purposely took εBε2 Sc 1 0 constant in both cases; as such, while the matrices K and their spectra dramatically differ, eigenvector ˆv2 is essentially the same in both matrices. For synthetic data, it notes 'n = 4 000 and p = 2 000', with 'xi .4N(µ1, Ip) + .6N(µ2, Ip) for [µT 1, µT 2]T N(0, 1 p[ 20 12 12 30 ] Ip)' and 'c0 = 1 2, n = 4 000, p = 2 000, b = 0'. For real-world images, it states 'p = 4 096-VGG features of randomly Big GAN-generated images' and 'p = 784-dimensional real word images from the Fashion-MNIST dataset'.