Memorization and Optimization in Deep Neural Networks with Minimum Over-parameterization

Authors: Simone Bombari, Mohammad Hossein Amani, Marco Mondelli

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In Figure 1, we consider a 3-layer neural network with d = n1 = n2, and we plot λmin (K) as a function of d2, for three different values of N. The inputs are sampled from a standard Gaussian distribution, the activation function is the sigmoid σ(x) = (1 + e x) 1, and we set βl = 1 for all l [L]. We repeat the experiment 10 times, and report average and confidence interval at 1 standard deviation. The linear scaling of λmin (K) in d2 is in agreement with the result of Theorem 3.1. The code used to obtain the results of Figure 1 (and Figure 2 as well) is available at https://github.com/simone-bombari/smallest-eigenvalue-NTK/. In Figure 2, we give an illustrative example that 4-layer networks achieve 0 loss when the number of parameters is at least linear in the number of training samples, i.e., under minimum over-parameterization. To ease the experimental setup, we use a Re LU activation, with Adam optimizer. We initialize the network as in the setting of Theorem 3.1, picking βl = 1 for all l [L]. The inputs, as well as the targets, are sampled from a standard Gaussian distribution. The plot is averaged over 10 independent trials.
Researcher Affiliation Academia Institute of Science and Technology Austria (ISTA). Emails: {simone.bombari, marco.mondelli}@ist.ac.at. EPFL, Switzerland. Email: mh.amani1998@gmail.com.
Pseudocode No The paper does not contain any sections explicitly labeled “Pseudocode” or “Algorithm”, nor does it present any structured algorithm blocks.
Open Source Code Yes The code used to obtain the results of Figure 1 (and Figure 2 as well) is available at https://github.com/simone-bombari/smallest-eigenvalue-NTK/.
Open Datasets Yes CIFAR-10 has N = 50000 images and roughly 106 parameters suffice to fit random labels [72]; furthermore, in order to fit random labels to a subset of 1.2 106 Image Net data points, 2.4 107 parameters are enough [72]. [...] (e.g., data with a Gaussian distribution, uniform on the sphere/hypercube, or obtained via a Generative Adversarial Network)
Dataset Splits No The paper mentions using
Hardware Specification No The paper does not provide any specific hardware details such as GPU/CPU models, memory, or cloud computing instance types used for running the experiments. It only mentions general setups like
Software Dependencies No The paper mentions using an
Experiment Setup Yes In Figure 1, we consider a 3-layer neural network with d = n1 = n2, and we plot λmin (K) as a function of d2, for three different values of N. The inputs are sampled from a standard Gaussian distribution, the activation function is the sigmoid σ(x) = (1 + e x) 1, and we set βl = 1 for all l [L]. We repeat the experiment 10 times, and report average and confidence interval at 1 standard deviation. [...] In Figure 2, we give an illustrative example that 4-layer networks achieve 0 loss [...] we use a Re LU activation, with Adam optimizer. We initialize the network as in the setting of Theorem 3.1, picking βl = 1 for all l [L]. The inputs, as well as the targets, are sampled from a standard Gaussian distribution. The plot is averaged over 10 independent trials. [...] the initialization θ0 is defined in (18) with γ = d3N 2 and η C(γNdn L 1) 1.