Neglected Hessian component explains mysteries in sharpness regularization

Authors: Yann N. Dauphin, Atish Agarwala, Hossein Mobahi

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide empirical and theoretical evidence that the NME is important to the performance of gradient penalties and explains their sensitivity to activation functions. and We conducted experimental studies to answer this question in the context of curvature regularization algorithms which seek to promote convergence to flat areas of the loss landscape.
Researcher Affiliation Industry Yann N. Dauphin Google Deep Mind ynd@google.com Atish Agarwala Google Deep Mind thetish@google.com Hossein Mobahi Google Deep Mind hmobahi@google.com
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks. Appendix C.2 contains actual code snippets in JAX, not pseudocode.
Open Source Code No Publicly available datasets are used, but the code is not open source. (from NeurIPS Checklist Q5)
Open Datasets Yes Fashion MNIST We also include results on Fashion MNIST [24]. CIFAR-10 We provide results on the CIFAR-10 dataset [25]. Imagenet We conduct experiments on the popular Imagenet dataset [27].
Dataset Splits No The paper mentions using standard setups for CIFAR-10 and Imagenet which imply standard splits, but does not explicitly state the training, validation, and test dataset splits with percentages or sample counts.
Hardware Specification Yes Models are trained on 8 Nvidia Volta GPUs. and Models are trained on using on TPU V3 chips.
Software Dependencies No The paper mentions using JAX (Appendix C.2, reference [40]) but does not provide specific version numbers for JAX or any other software dependencies.
Experiment Setup Yes All experiments use the Wide Resnet 28-10 architecture with the same setup and hyperparameters as [26], except for the use of cosine learning rate decay. Batch size is 128. and All experiments use the Resnet-50 architecture with the same setup and hyperparameters as [28], except that we use cosine learning rate decay [29] over 350 epochs. Batch size is set to 1024.