Curvature-corrected learning dynamics in deep neural networks

Authors: Dongsung Huh

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To test the main theoretical results, we conducted a simple synthetic data experiment
Researcher Affiliation Collaboration MIT-IBM Watson AI Lab, Cambridge, Massachusetts, USA.
Pseudocode No The paper does not contain pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper does not provide concrete access to source code for the methodology described.
Open Datasets No To test the main theoretical results, we conducted a simple synthetic data experiment, in which the training and the testing datasets are generated from a random teacher network as yµ = wteacherxµ + zµ, where xµ RN is the whitened input data, yµ RN is the output, zµ RN is the noise (Lampinen & Ganguli, 2018).
Dataset Splits No The paper mentions 'training and the testing datasets' but does not specify exact split percentages, absolute sample counts, or explicit mention of a validation set.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers).
Experiment Setup Yes The student network is trained from small random initial weights. Hessian+ blocks are computed as described in Bernacchia et al. (2018); Botev et al. (2017) and combined to obtain full Hessian+. NGD-d and NGD-d only used the diagonal blocks. Numerical pseudo-inverses (and sqrt-inverses) are computed via singular value decomposition (SVD). For numerical stability, NGD and NGD-d used Levenberg Marquardt damping of ϵ = 10 5 and update-speed clipping. The input-output map of the teacher network wteacher RN N has a low-rank structure (rank 3, Fig 4A) and the student is a depth d = 4 linear network of constant width N = 16. The number of training dataset {xµ, yµ}P µ=1 is set to be P = N.