BN-invariant Sharpness Regularizes the Training Model to Better Generalization

Authors: Mingyang Yi, Huishuai Zhang, Wei Chen, Zhi-Ming Ma, Tie-Yan Liu

IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our algorithm achieves considerably better performance than vanilla SGD over various experiment settings. [...] We test our algorithm on CIFAR dataset [Krizhevsky et al., 2012], it has preferable results under large batch size compared with baselines (SGD, Entropy SGD).
Researcher Affiliation Collaboration Mingyang Yi1,2 , Huishuai Zhang3 , Wei Chen3 , Zhi-Ming Ma1,2 and Tie-Yan Liu3 1University of Chinese Academy of Sciences 2Academy of Mathematics and Systems Science 3 Microsoft Research yimingyang17@mails.ucas.edu.cn, mazm@amt.ac.cn, {huzhang, wche, tie-yan.liu}@microsoft.com
Pseudocode Yes Algorithm 1 SGD with BN-Sharpness regularization
Open Source Code No The paper does not provide an explicit statement or link for open-source code availability.
Open Datasets Yes First we test the algorithm with fully batch normalized Le Net [Le Cun et al., 1998] to test the performance for CIFAR10 [Krizhevsky et al., 2012].
Dataset Splits Yes First we test the algorithm with fully batch normalized Le Net [Le Cun et al., 1998] to test the performance for CIFAR10 [Krizhevsky et al., 2012]. [...] For SGDS, the δ in CIFAR10 is 5e-4 and in CIFAR100 is 1e-3, learning rate is 0.2 and decay by a factor 0.1 respectively in epoch 60, 120, 160.
Hardware Specification No The paper does not provide specific details about the hardware used for experiments (e.g., GPU/CPU models, memory).
Software Dependencies No The paper does not provide specific version numbers for software dependencies or libraries used in the experiments.
Experiment Setup Yes Update rule is SGD with momentum by setting learning rate as 0.2 and decay it by a factor 0.1 respectively in epoch 60, 120, 160 and momentum parameter as 0.9. We use 10000 batch size, and 5e-4 weight decay ratio for all the three experiments. [...] For the experiments with regularized BN-Sharpness, we choose λ as 1e-4 which increase by a factor of 1.02 for each epoch. We set δ = 0.001, and the p is chosen as 2.