Curvature-corrected learning dynamics in deep neural networks
Authors: Dongsung Huh
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To test the main theoretical results, we conducted a simple synthetic data experiment |
| Researcher Affiliation | Collaboration | MIT-IBM Watson AI Lab, Cambridge, Massachusetts, USA. |
| Pseudocode | No | The paper does not contain pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. |
| Open Datasets | No | To test the main theoretical results, we conducted a simple synthetic data experiment, in which the training and the testing datasets are generated from a random teacher network as yµ = wteacherxµ + zµ, where xµ RN is the whitened input data, yµ RN is the output, zµ RN is the noise (Lampinen & Ganguli, 2018). |
| Dataset Splits | No | The paper mentions 'training and the testing datasets' but does not specify exact split percentages, absolute sample counts, or explicit mention of a validation set. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers). |
| Experiment Setup | Yes | The student network is trained from small random initial weights. Hessian+ blocks are computed as described in Bernacchia et al. (2018); Botev et al. (2017) and combined to obtain full Hessian+. NGD-d and NGD-d only used the diagonal blocks. Numerical pseudo-inverses (and sqrt-inverses) are computed via singular value decomposition (SVD). For numerical stability, NGD and NGD-d used Levenberg Marquardt damping of ϵ = 10 5 and update-speed clipping. The input-output map of the teacher network wteacher RN N has a low-rank structure (rank 3, Fig 4A) and the student is a depth d = 4 linear network of constant width N = 16. The number of training dataset {xµ, yµ}P µ=1 is set to be P = N. |