Bayes-optimal Learning of Deep Random Networks of Extensive-width

Authors: Hugo Cui, Florent Krzakala, Lenka Zdeborova

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We further show numerically that when the number of samples grows faster than the dimension, ridge & kernel methods become suboptimal, while neural networks achieve test error close to zero from quadratically many samples. We provide a numerical exploration of the regime where the number of samples n tends to infinity faster than linearly with the input dimension d... Fig. 1 shows the Bayes MSE, eq. (12)... This is contrasted to the MSE achieved by an expressive neural network (NN)... Fig. 4 contrasts the MSE of an Adam-optimized neural network, optimally regularized ridge regression, and optimally regularized arcosine kernel regression...
Researcher Affiliation Academia 1Statistical Physics Of Computation lab, Institute of Physics, Ecole Polytechnique F ed erale de Lausanne, 1015 Lausanne, Switzerland 2Information Learning and Physics lab, Institute of Electrical Engineering, Ecole Polytechnique F ed erale de Lausanne, 1015 Lausanne, Switzerland.
Pseudocode No The paper contains no structured pseudocode or algorithm blocks.
Open Source Code Yes A repository with the code employed in the present work can be found here.
Open Datasets No We consider the problem of learning from a train set D = {xµ, yµ}n µ=1, with n independently sampled Gaussian covariates xµ Rd N(0, Σ). The paper uses synthetically generated data rather than a publicly available dataset with a specific name or access information.
Dataset Splits No The paper does not explicitly describe training/test/validation splits or cross-validation setup. It mentions 'train set' and 'test sample' but no further breakdown or methodology for splitting.
Hardware Specification No The paper performs numerical simulations but does not provide any specific hardware details such as CPU or GPU models, or cloud computing specifications.
Software Dependencies No The paper mentions 'full-batch gradient descent' and 'Adam' optimizer but does not specify any software libraries or their version numbers (e.g., PyTorch, TensorFlow, scikit-learn).
Experiment Setup Yes Green dots represent simulations for a one (top) and two (bottom) hidden layers neural network of width 1500, optimized with full-batch GD, learning rate η = 8.10 3 and weight decay λ = 0.1. Green dots show the test error of a three layers fully connected network trained end-to-end with full-batch Adam, learning rate 0.003 and weight decay 0.01, after 2000 epochs. Purple dots indicate the MSE of a 2 layers fully connected neural network of width k = 30 trained end-to-end using Adam (purple), batch size n/3 and learning rate η = 3.10 3, over 2000 epochs.