Learning Representation from Neural Fisher Kernel with Low-rank Approximation

Authors: Ruixiang ZHANG, Shuangfei Zhai, Etai Littwin, Joshua M. Susskind

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we evaluate NFK in the following settings. We first evaluate the proposed low-rank kernel approximation algorithm (Sec. 3.2), in terms of both approximation accuracy and running time efficiency. Next, we evaluate NFK on various representation learning tasks in both supervised, semi-supervised and unsupervised learning settings.
Researcher Affiliation Collaboration Ruixiang Zhang Mila, Université de Montréal ruixiang.zhang@umontreal.ca Shuangfei Zhai, Etai Littwin, Josh Susskind Apple Inc. {szhai,elittwin,jsusskind}@apple.com
Pseudocode Yes Algorithm 1 Baseline method: compute low-rank NFK feature embedding
Open Source Code No The paper does not provide a specific repository link or an explicit statement about releasing the source code for the methodology described.
Open Datasets Yes We present our results on CIFAR-10 (Krizhevsky et al., 2009a) in Table. 1. ... We evaluate our method on CIFAR-10 (Krizhevsky et al., 2009a) and SVHN datasets (Krizhevsky et al., 2009b).
Dataset Splits No The paper mentions using well-known datasets like CIFAR-10 and SVHN but does not explicitly state the train/validation/test dataset splits (e.g., percentages or sample counts) within the main text or appendices for reproducibility.
Hardware Specification No The paper mentions
Software Dependencies No The paper mentions using "Jax (Bradbury et al., 2018)", the "neural-tangets (Novak et al., 2020) library", and "sklearn.decomposition.Truncated SVD" but does not specify exact version numbers for these software dependencies, which is required for reproducibility.
Experiment Setup Yes For the Neural Fisher Kernel Distillation (NFKD) experiments, ... We run 10 power iterations to compute the SVD approximation of the NFK of the teacher model, to obtain the top-20 eigenvectors and eigenvalues. Then we train the student model with the additional NFKD distillation loss using mini-batch stochastic gradient descent, with 0.9 momentum, for 250 epochs. The initial learning rate begins at 0.1 and we decay the learning rate by 0.1 at 150-th epoch and decay again by 0.1 at 200-th epoch.