The asymptotic spectrum of the Hessian of DNN throughout training

Authors: Arthur Jacot, Franck Gabriel, Clement Hongler

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental All our numerical experiments are done with rectangular networks (with n1 = ... = n L 1) and match closely the predictions for the sequential limit. Figure 1: Comparison of the theoretical prediction of Corollary 1 for the expectation of the first 4 moments (colored lines) to the empirical average over 250 trials (black crosses) for a rectangular network with two hidden layers of finite widths n1 = n2 = 5000 (L = 3) with the smooth Re LU (left) and the normalized smooth Re LU (right), for the MSE loss on scaled down 14x14 MNIST with N = 256.
Researcher Affiliation Academia Arthur Jacot, Franck Gabriel & Cl ement Hongler Chair of Statistical Field Theory Ecole Polytechnique F ed erale de Lausanne {arthur.jacot,franck.grabriel,clement.hongler}@epfl.ch
Pseudocode No The paper contains mathematical derivations and proofs but does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any statement about releasing source code for the described methodology or a link to a code repository.
Open Datasets Yes Figure 1: Comparison of the theoretical prediction of Corollary 1 for the expectation of the first 4 moments (colored lines) to the empirical average over 250 trials (black crosses) for a rectangular network with two hidden layers of finite widths n1 = n2 = 5000 (L = 3) with the smooth Re LU (left) and the normalized smooth Re LU (right), for the MSE loss on scaled down 14x14 MNIST with N = 256.
Dataset Splits No The paper mentions using a dataset (MNIST with N=256) but does not provide specific training, validation, or test split percentages or sample counts.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used for running its numerical experiments.
Software Dependencies No The paper does not provide any specific software dependencies with version numbers (e.g., programming languages, libraries, or frameworks).
Experiment Setup Yes All parameters are initialized as iid N(0, 1) Gaussians. In our experiments, we take β = 0.1. The network is trained with respect to the cost functional: i=1 ci (f(xi)) , for strictly convex ci, summing over a finite dataset x1, . . . , x N Rn0 of size N. The parameters are then trained with gradient descent on the composition C F (L), which defines the usual loss surface of neural networks. Figure 1: ...for a rectangular network with two hidden layers of finite widths n1 = n2 = 5000 (L = 3) with the smooth Re LU (left) and the normalized smooth Re LU (right), for the MSE loss on scaled down 14x14 MNIST with N = 256.