Asymptotics of Learning with Deep Structured (Random) Features

Authors: Dominik Schröder, Daniil Dmitriev, Hugo Cui, Bruno Loureiro

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically find that our theoretical characterization captures well the learning curves of some networks trained by gradient descent in the lazy regime. Code The code for the numerical experiments described in Appendix C is openly available in this repository. Fig. 2 contrasts the test error achieved by linear regression (red), and regression on the feature map associated to a three-layer student at initialization (green) and after 3000 epochs of end-to-end training using full-batch Adam (Kingma & Ba, 2014) at learning rate 10 4 and weight decay 10 3 over n0 = 1400 training samples (blue). Real data We observe that the theoretical predictions of Theorem 3.1 also capture the learning curves of trained networks on some real datasets, when retraining the readout only using ridge regression, provided the features covariances Ω, Φ, Ψ are estimated from data. Fig. 3 contrasts the theoretical characterization of Theorem 3.1 with numerical experiments on MNIST (Le Cun et al., 1998)
Researcher Affiliation Academia 1Department of Mathematics, ETH Zurich, 8006 Z urich, Switzerland 2Department of Mathematics, ETH Zurich and ETH AI Center, 8092 Z urich, Switzerland 3Statistical Physics Of Computation lab., Institute of Physics, Ecole Polytechnique F ed erale de Lausanne (EPFL), 1015 Lausanne, Switzerland 4D epartement d Informatique, Ecole Normale Sup erieure (ENS) PSL & CNRS, F-75230 Paris cedex 05, France.
Pseudocode No The paper focuses on mathematical derivations and theoretical analysis, and does not include any pseudocode or algorithm blocks.
Open Source Code Yes Code The code for the numerical experiments described in Appendix C is openly available in this repository.
Open Datasets Yes We use the MNIST data set which we normalize by pixel-wise centering and global scaling to ensure unit variance. For each normalized image xi R784 we define a label ( 1, if xi is an even digit, 1, if xi is an odd digit. We split the data set into four parts: 10% Test data Itest, 25% Training data for the Adam optimizer IAdam, 25% Training data for regression Ireg, 40% Data for approximating the (empirical) population covariance Iemp.
Dataset Splits Yes We split the data set into four parts: 10% Test data Itest, 25% Training data for the Adam optimizer IAdam, 25% Training data for regression Ireg, 40% Data for approximating the (empirical) population covariance Iemp.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, memory, or cloud instance types used for running experiments.
Software Dependencies No Neural network We then train a simple neural network of the form x 7 θ φ(x), φ(x) := θ relu(W2 relu(W1x)), W1 R2352 784, W2 R2352 2352, θ R2352 using the Adam optimizer over 120 epochs with a batch size of 128 using only the IAdam split. The paper mentions 'Tensorflow implementation of the Adam' but does not specify version numbers for Tensorflow or any other libraries.
Experiment Setup Yes Neural network We then train a simple neural network of the form x 7 θ φ(x), φ(x) := θ relu(W2 relu(W1x)), W1 R2352 784, W2 R2352 2352, θ R2352 using the Adam optimizer over 120 epochs with a batch size of 128 using only the IAdam split. During training we save the feature maps φt at various time steps t in order to study the training dynamics. ... using the Adam (Kingma & Ba, 2014) optimizer, over 120 epochs with batch size 128 at learning rate 10 4 and weight decay 10 3 over n0 = 1400 training samples (blue).