Local Signal Adaptivity: Provable Feature Learning in Neural Networks Beyond Kernels

Authors: Stefani Karp, Ezra Winston, Yuanzhi Li, Aarti Singh

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We supplement our theoretical results by demonstrating this phenomenon empirically: in CIFAR-10 and MNIST images with various backgrounds, as the background noise increases in intensity, a CNN s performance stays relatively robust, whereas its corresponding neural tangent kernel sees a notable drop in performance.
Researcher Affiliation Collaboration Stefani Karp Carnegie Mellon University and Google Research shkarp@cs.cmu.edu Ezra Winston Carnegie Mellon University ewinston@cs.cmu.edu Yuanzhi Li Carnegie Mellon University yuanzhil@cs.cmu.edu Aarti Singh Carnegie Mellon University aarti@cs.cmu.edu
Pseudocode Yes Algorithm 1 Mini-batch SGD
Open Source Code Yes Code for experiments is available at https://github.com/skarp/local-signal-adaptivity.
Open Datasets Yes We create new datasets by embedding CIFAR-10 and MNIST images within either random Gaussian or IMAGENET backgrounds. ... CIFAR-10 [Krizhevsky, 2009] ... MNIST [Le Cun et al., 2010] ... IMAGENET backgrounds [Deng et al., 2009].
Dataset Splits No The paper does not explicitly provide specific training, validation, or test dataset split percentages or counts in the main text.
Hardware Specification No The paper describes the models used (e.g., '10-layer Wide ResNet', 'small CNN') but does not specify the hardware used for training or inference, such as GPU or CPU models.
Software Dependencies No The paper mentions software like 'Neural Tangents' and 'JAX' but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes We initialize b deterministically at 0. We initialize w randomly by drawing from N 0, σ2 0Id d , where σ0 is 1/poly(k). We train the above CNN using mini-batch stochastic gradient descent (SGD) with the logistic loss... We adopt a 1/poly(k) learning rate for w, and we set ηb/ηw = 1/k. ... Sample a mini-batch of examples of size n = poly(k)... T = poly(k) iterations.