Memorization and Optimization in Deep Neural Networks with Minimum Over-parameterization
Authors: Simone Bombari, Mohammad Hossein Amani, Marco Mondelli
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In Figure 1, we consider a 3-layer neural network with d = n1 = n2, and we plot λmin (K) as a function of d2, for three different values of N. The inputs are sampled from a standard Gaussian distribution, the activation function is the sigmoid σ(x) = (1 + e x) 1, and we set βl = 1 for all l [L]. We repeat the experiment 10 times, and report average and confidence interval at 1 standard deviation. The linear scaling of λmin (K) in d2 is in agreement with the result of Theorem 3.1. The code used to obtain the results of Figure 1 (and Figure 2 as well) is available at https://github.com/simone-bombari/smallest-eigenvalue-NTK/. In Figure 2, we give an illustrative example that 4-layer networks achieve 0 loss when the number of parameters is at least linear in the number of training samples, i.e., under minimum over-parameterization. To ease the experimental setup, we use a Re LU activation, with Adam optimizer. We initialize the network as in the setting of Theorem 3.1, picking βl = 1 for all l [L]. The inputs, as well as the targets, are sampled from a standard Gaussian distribution. The plot is averaged over 10 independent trials. |
| Researcher Affiliation | Academia | Institute of Science and Technology Austria (ISTA). Emails: {simone.bombari, marco.mondelli}@ist.ac.at. EPFL, Switzerland. Email: mh.amani1998@gmail.com. |
| Pseudocode | No | The paper does not contain any sections explicitly labeled “Pseudocode” or “Algorithm”, nor does it present any structured algorithm blocks. |
| Open Source Code | Yes | The code used to obtain the results of Figure 1 (and Figure 2 as well) is available at https://github.com/simone-bombari/smallest-eigenvalue-NTK/. |
| Open Datasets | Yes | CIFAR-10 has N = 50000 images and roughly 106 parameters suffice to fit random labels [72]; furthermore, in order to fit random labels to a subset of 1.2 106 Image Net data points, 2.4 107 parameters are enough [72]. [...] (e.g., data with a Gaussian distribution, uniform on the sphere/hypercube, or obtained via a Generative Adversarial Network) |
| Dataset Splits | No | The paper mentions using |
| Hardware Specification | No | The paper does not provide any specific hardware details such as GPU/CPU models, memory, or cloud computing instance types used for running the experiments. It only mentions general setups like |
| Software Dependencies | No | The paper mentions using an |
| Experiment Setup | Yes | In Figure 1, we consider a 3-layer neural network with d = n1 = n2, and we plot λmin (K) as a function of d2, for three different values of N. The inputs are sampled from a standard Gaussian distribution, the activation function is the sigmoid σ(x) = (1 + e x) 1, and we set βl = 1 for all l [L]. We repeat the experiment 10 times, and report average and confidence interval at 1 standard deviation. [...] In Figure 2, we give an illustrative example that 4-layer networks achieve 0 loss [...] we use a Re LU activation, with Adam optimizer. We initialize the network as in the setting of Theorem 3.1, picking βl = 1 for all l [L]. The inputs, as well as the targets, are sampled from a standard Gaussian distribution. The plot is averaged over 10 independent trials. [...] the initialization θ0 is defined in (18) with γ = d3N 2 and η C(γNdn L 1) 1. |