Stochastic Normalization

Authors: Zhi Kou, Kaichao You, Mingsheng Long, Jianmin Wang

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive empirical experiments show that Stoch Norm is a powerful tool to avoid over-fitting in fine-tuning with small datasets. Besides, Stoch Norm is readily pluggable in modern CNN backbones. It is complementary to other fine-tuning methods and can work together to achieve stronger regularization effect.
Researcher Affiliation Academia Zhi Kou , Kaichao You , Mingsheng Long (B), Jianmin Wang School of Software, BNRist, Research Center for Big Data, Tsinghua University, China {kz19,ykc20}@mails.tsinghua.edu.cn, {mingsheng,jimwang}@tsinghua.edu.cn
Pseudocode Yes Stoch Norm is intuitively described in Figure 1 and summarized in detail by Algorithm 1.
Open Source Code Yes The code is available at https://github.com/thuml/StochNorm.
Open Datasets Yes The evaluation is conducted on four standard datasets. CUB-200-2011 (Welinder et al., 2010)... Stanford Cars (Krause et al., 2013)... FGVC Aircraft (Maji et al., 2013)... NIH Chest X-ray (Wang et al., 2017)
Dataset Splits Yes We follow the train/validation/test partition of each dataset. For datasets without validation data, we use 20% training data for validation and use the same validation data for all methods.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory used for running the experiments.
Software Dependencies No The paper states 'Experiments are implemented based on Py Torch (Benoit et al., 2019)' but does not provide a specific version number for PyTorch or any other software dependencies.
Experiment Setup Yes The learning rate for the last layer is set to be 10 times of those for the fine-tuned layers because parameters in the last layer are randomly initialized. We adopt SGD with momentum of 0.9 together with the progressive training strategies in Li et al. (2018). Experiments are repeated five times to get the mean and deviation. Hyper-parameters for each method are selected on validation data. We follow the train/validation/test partition of each dataset. For datasets without validation data, we use 20% training data for validation and use the same validation data for all methods. The selection probability p = 0.5 works well for most experiments.