Bayes-optimal learning of an extensive-width neural network from quadratically many samples
Authors: Antoine Maillard, Emanuele Troiani, Simon Martin, Florent Krzakala, Lenka Zdeborová
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We further show empirically that, in the absence of noise, randomly-initialized gradient descent seems to sample the space of weights, leading to zero training loss, and averaging over initialization leads to a test error equal to the Bayes-optimal one. Our experiments are reproducible, and accessible freely in a public repository [Maillard et al., 2024]. |
| Researcher Affiliation | Academia | Antoine Maillard Department of Mathematics ETH Zürich, Switzerland Emanuele Troiani Statistical Physics Of Computation Laboratory EPFL, Switzerland Simon Martin INRIA & Laboratoire de Physique ENS, Université PSL, France Lenka Zdeborová Statistical Physics Of Computation Laboratory EPFL, Switzerland Florent Krzakala Information Learning and Physics Laboratory EPFL, Switzerland |
| Pseudocode | Yes | Algorithm 1: GAMP-RIE |
| Open Source Code | Yes | Our experiments are reproducible, and accessible freely in a public repository [Maillard et al., 2024]. [Maillard et al., 2024] Antoine Maillard, Emanuele Troiani, Simon Martin, Florent Krzakala, and Zdeborová Lenka. Numerical code used for experimental results. https://github.com/SPOC-group/ Extensive Width Quadratic Samples, 2024. |
| Open Datasets | No | More concretely, we consider a dataset of n samples D = {yi, xi}n i=1 where the input data is normal Gaussian of dimension d: (xi)n i=1 i.i.d. N(0, Id). We then draw i.i.d. d-dimensional teacher-weight vectors (w k)m k=1 i.i.d. N(0, Id), and noise (zi)n i=1 i.i.d. N(0, Im). The paper generates synthetic data based on specified distributions rather than using a publicly available dataset with a direct access link or citation. |
| Dataset Splits | No | The paper states 'We consider a dataset of n samples D = {yi, xi}n i=1' but does not explicitly define or specify training, validation, or test dataset splits (percentages or absolute counts) for their empirical evaluations, as the data is generated synthetically for each experiment. |
| Hardware Specification | No | A single run of vanilla GD for the models we display can be completed in at most 30 minutes on an average machine without using GPUs. For producing our figures we used around 30 000 hours of computing time. The paper mentions 'an average machine' and 'without using GPUs' but does not provide specific hardware details such as CPU models, memory, or specific GPU models used for the main compute. |
| Software Dependencies | No | All the simulations are done in Py Torch with the student weights initialized in the prior. The paper mentions 'Py Torch' but does not specify a version number or list other software dependencies with versions. |
| Experiment Setup | Yes | The learning rate is chosen to be suitably large, as it s typically better to train a networks with giant steps [Dandi et al., 2023]. We used the learning rates 0.2 for d=200 and 0.07 for d=100. In Figure 2 the gradient descent is run for zero regularization, λ = 0. |