Neglected Hessian component explains mysteries in sharpness regularization
Authors: Yann N. Dauphin, Atish Agarwala, Hossein Mobahi
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide empirical and theoretical evidence that the NME is important to the performance of gradient penalties and explains their sensitivity to activation functions. and We conducted experimental studies to answer this question in the context of curvature regularization algorithms which seek to promote convergence to flat areas of the loss landscape. |
| Researcher Affiliation | Industry | Yann N. Dauphin Google Deep Mind ynd@google.com Atish Agarwala Google Deep Mind thetish@google.com Hossein Mobahi Google Deep Mind hmobahi@google.com |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. Appendix C.2 contains actual code snippets in JAX, not pseudocode. |
| Open Source Code | No | Publicly available datasets are used, but the code is not open source. (from NeurIPS Checklist Q5) |
| Open Datasets | Yes | Fashion MNIST We also include results on Fashion MNIST [24]. CIFAR-10 We provide results on the CIFAR-10 dataset [25]. Imagenet We conduct experiments on the popular Imagenet dataset [27]. |
| Dataset Splits | No | The paper mentions using standard setups for CIFAR-10 and Imagenet which imply standard splits, but does not explicitly state the training, validation, and test dataset splits with percentages or sample counts. |
| Hardware Specification | Yes | Models are trained on 8 Nvidia Volta GPUs. and Models are trained on using on TPU V3 chips. |
| Software Dependencies | No | The paper mentions using JAX (Appendix C.2, reference [40]) but does not provide specific version numbers for JAX or any other software dependencies. |
| Experiment Setup | Yes | All experiments use the Wide Resnet 28-10 architecture with the same setup and hyperparameters as [26], except for the use of cosine learning rate decay. Batch size is 128. and All experiments use the Resnet-50 architecture with the same setup and hyperparameters as [28], except that we use cosine learning rate decay [29] over 350 epochs. Batch size is set to 1024. |