Neural Networks as Kernel Learners: The Silent Alignment Effect

Authors: Alexander Atanasov, Blake Bordelon, Cengiz Pehlevan

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically show that such an effect takes place in homogenous neural networks with small initialization and whitened data. We provide an analytical treatment of this effect in the fully connected linear network case. In general, we find that the kernel develops a low-rank contribution in the early phase of training, and then evolves in overall scale, yielding a function equivalent to a kernel regression solution with the final network s tangent kernel. The early spectral learning of the kernel depends on the depth. We also demonstrate that non-whitened data can weaken the silent alignment effect.
Researcher Affiliation Academia Alexander Atanasov , Blake Bordelon & Cengiz Pehlevan Harvard University Cambridge, MA 02138, USA {atanasov,blake bordelon,cpehlevan}@g.harvard.edu
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide concrete access to source code for the methodology described.
Open Datasets Yes We trained a 2-layer Re LU MLP on P = 1000 MNIST images of handwritten 0 s and 1 s which were whitened. Early in training, around t 50, the NTK aligns to the target function and stay fixed (green). The kernel s overall scale (orange) and the loss (blue) begin to move at around t = 300. The analytic solution for the maximal final alignment value in linear networks is overlayed (dashed green), see Appendix E.2. (b) We compare the predictions of the NTK and the trained network on MNIST test points. Due to silent alignment, the final learned function is well described as a kernel regression solution with the final NTK K . However, regression with the initial NTK is not a good model of the network s predictions. (c) The same experiment on P = 1000 whitened CIFAR-10 images from the first two classes. Here we use MSE loss on a width 100 network with initialization scale σ = 0.1.
Dataset Splits No The paper provides specific training and test set details, but does not explicitly mention or detail a separate validation dataset split.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No The paper mentions software like the Adam optimizer and the Neural Tangents API, but it does not provide specific version numbers for any software dependencies.
Experiment Setup Yes We trained a 2-layer Re LU MLP on P = 1000 MNIST images of handwritten 0 s and 1 s which were whitened. Early in training, around t 50, the NTK aligns to the target function and stay fixed (green). The kernel s overall scale (orange) and the loss (blue) begin to move at around t = 300. The analytic solution for the maximal final alignment value in linear networks is overlayed (dashed green), see Appendix E.2. (b) We compare the predictions of the NTK and the trained network on MNIST test points. Due to silent alignment, the final learned function is well described as a kernel regression solution with the final NTK K . However, regression with the initial NTK is not a good model of the network s predictions. (c) The same experiment on P = 1000 whitened CIFAR-10 images from the first two classes. Here we use MSE loss on a width 100 network with initialization scale σ = 0.1.