The asymptotic spectrum of the Hessian of DNN throughout training
Authors: Arthur Jacot, Franck Gabriel, Clement Hongler
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | All our numerical experiments are done with rectangular networks (with n1 = ... = n L 1) and match closely the predictions for the sequential limit. Figure 1: Comparison of the theoretical prediction of Corollary 1 for the expectation of the first 4 moments (colored lines) to the empirical average over 250 trials (black crosses) for a rectangular network with two hidden layers of finite widths n1 = n2 = 5000 (L = 3) with the smooth Re LU (left) and the normalized smooth Re LU (right), for the MSE loss on scaled down 14x14 MNIST with N = 256. |
| Researcher Affiliation | Academia | Arthur Jacot, Franck Gabriel & Cl ement Hongler Chair of Statistical Field Theory Ecole Polytechnique F ed erale de Lausanne {arthur.jacot,franck.grabriel,clement.hongler}@epfl.ch |
| Pseudocode | No | The paper contains mathematical derivations and proofs but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any statement about releasing source code for the described methodology or a link to a code repository. |
| Open Datasets | Yes | Figure 1: Comparison of the theoretical prediction of Corollary 1 for the expectation of the first 4 moments (colored lines) to the empirical average over 250 trials (black crosses) for a rectangular network with two hidden layers of finite widths n1 = n2 = 5000 (L = 3) with the smooth Re LU (left) and the normalized smooth Re LU (right), for the MSE loss on scaled down 14x14 MNIST with N = 256. |
| Dataset Splits | No | The paper mentions using a dataset (MNIST with N=256) but does not provide specific training, validation, or test split percentages or sample counts. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used for running its numerical experiments. |
| Software Dependencies | No | The paper does not provide any specific software dependencies with version numbers (e.g., programming languages, libraries, or frameworks). |
| Experiment Setup | Yes | All parameters are initialized as iid N(0, 1) Gaussians. In our experiments, we take β = 0.1. The network is trained with respect to the cost functional: i=1 ci (f(xi)) , for strictly convex ci, summing over a finite dataset x1, . . . , x N Rn0 of size N. The parameters are then trained with gradient descent on the composition C F (L), which defines the usual loss surface of neural networks. Figure 1: ...for a rectangular network with two hidden layers of finite widths n1 = n2 = 5000 (L = 3) with the smooth Re LU (left) and the normalized smooth Re LU (right), for the MSE loss on scaled down 14x14 MNIST with N = 256. |