Hessian-based Analysis of Large Batch Training and Robustness to Adversaries
Authors: Zhewei Yao, Amir Gholami, Qi Lei, Kurt Keutzer, Michael W. Mahoney
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on multiple networks show that saddle-points are not the cause for generalization gap of large batch size training, and the results consistently show that large batch converges to points with noticeably higher Hessian spectrum. We present detailed experiments with five different network architectures, including a residual network, tested on MNIST, CIFAR-10, and CIFAR-100 datasets. |
| Researcher Affiliation | Academia | Zhewei Yao1 Amir Gholami1 Qi Lei2 Kurt Keutzer1 Michael W. Mahoney1 1 University of California at Berkeley, {zheweiy, amirgh, keutzer and mahoneymw}@berkeley.edu 2 University of Texas at Austin, leiqi@ices.utexas.edu |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions a follow-up paper ([33]) that designed a new algorithm, but does not provide any link or explicit statement about the source code for the methodology presented in this paper. |
| Open Datasets | Yes | We present detailed experiments with five different network architectures, including a residual network, tested on MNIST, CIFAR-10, and CIFAR-100 datasets. |
| Dataset Splits | Yes | We present detailed experiments with five different network architectures, including a residual network, tested on MNIST, CIFAR-10, and CIFAR-100 datasets. For the original training, we set the learning rate to 0.01 and momentum to 0.9, and decay the learning rate by half after every 5 epochs, for a total of 100 epochs. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or cloud instance types) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | For the original training, we set the learning rate to 0.01 and momentum to 0.9, and decay the learning rate by half after every 5 epochs, for a total of 100 epochs. Then we perform an additional five epochs of adversarial training with a learning rate of 0.01. The perturbation magnitude, , is set to 0.1 for L1 attack and 2.8 for L2 attack. We also present results for C3 model [4] on CIFAR-10, using the same hyper-parameters, except that the training is performed for 100 epochs. Afterwards, adversarial training is performed for a subsequent 10 epochs with a learning rate of 0.01 and momentum of 0.9 (the learning rate is decayed by half after five epochs). Furthermore, the adversarial perturbation magnitude is set to = 0.02 for L1 attack and 1.2 for L2 attack[27]. |