Bayes-optimal Learning of Deep Random Networks of Extensive-width
Authors: Hugo Cui, Florent Krzakala, Lenka Zdeborova
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We further show numerically that when the number of samples grows faster than the dimension, ridge & kernel methods become suboptimal, while neural networks achieve test error close to zero from quadratically many samples. We provide a numerical exploration of the regime where the number of samples n tends to infinity faster than linearly with the input dimension d... Fig. 1 shows the Bayes MSE, eq. (12)... This is contrasted to the MSE achieved by an expressive neural network (NN)... Fig. 4 contrasts the MSE of an Adam-optimized neural network, optimally regularized ridge regression, and optimally regularized arcosine kernel regression... |
| Researcher Affiliation | Academia | 1Statistical Physics Of Computation lab, Institute of Physics, Ecole Polytechnique F ed erale de Lausanne, 1015 Lausanne, Switzerland 2Information Learning and Physics lab, Institute of Electrical Engineering, Ecole Polytechnique F ed erale de Lausanne, 1015 Lausanne, Switzerland. |
| Pseudocode | No | The paper contains no structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | A repository with the code employed in the present work can be found here. |
| Open Datasets | No | We consider the problem of learning from a train set D = {xµ, yµ}n µ=1, with n independently sampled Gaussian covariates xµ Rd N(0, Σ). The paper uses synthetically generated data rather than a publicly available dataset with a specific name or access information. |
| Dataset Splits | No | The paper does not explicitly describe training/test/validation splits or cross-validation setup. It mentions 'train set' and 'test sample' but no further breakdown or methodology for splitting. |
| Hardware Specification | No | The paper performs numerical simulations but does not provide any specific hardware details such as CPU or GPU models, or cloud computing specifications. |
| Software Dependencies | No | The paper mentions 'full-batch gradient descent' and 'Adam' optimizer but does not specify any software libraries or their version numbers (e.g., PyTorch, TensorFlow, scikit-learn). |
| Experiment Setup | Yes | Green dots represent simulations for a one (top) and two (bottom) hidden layers neural network of width 1500, optimized with full-batch GD, learning rate η = 8.10 3 and weight decay λ = 0.1. Green dots show the test error of a three layers fully connected network trained end-to-end with full-batch Adam, learning rate 0.003 and weight decay 0.01, after 2000 epochs. Purple dots indicate the MSE of a 2 layers fully connected neural network of width k = 30 trained end-to-end using Adam (purple), batch size n/3 and learning rate η = 3.10 3, over 2000 epochs. |