Understanding the Generalization Benefit of Normalization Layers: Sharpness Reduction

Authors: Kaifeng Lyu, Zhiyuan Li, Sanjeev Arora

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experimentally verified the sharpness-reduction phenomenon predicted by our theorem and its benefits to generalization on CIFAR-10 with VGG-11 and Res Net-20, as well as matrix completion with BN (Appendix P).
Researcher Affiliation Academia Kaifeng Lyu Zhiyuan Li Sanjeev Arora Department of Computer Science Princeton University {klyu,zhiyuanli,arora}@cs.princeton.edu
Pseudocode No The paper describes algorithms and theoretical steps in prose, and includes mathematical equations, but it does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks or figures.
Open Source Code Yes 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] See our supplementary material.
Open Datasets Yes We experimentally verified the sharpness-reduction phenomenon predicted by our theorem and its benefits to generalization on CIFAR-10 with VGG-11 and Res Net-20, as well as matrix completion with BN (Appendix P).
Dataset Splits No The paper states 'Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See Appendix Q.' However, the main text provided does not explicitly detail the training/validation/test dataset splits with percentages, counts, or specific methodologies.
Hardware Specification No The paper states 'Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See Appendix Q.' However, the main text provided does not contain specific hardware details like GPU/CPU models or memory amounts.
Software Dependencies No The paper does not list specific software dependencies with version numbers (e.g., 'PyTorch 1.9', 'CUDA 11.1') in the provided main text.
Experiment Setup No The paper mentions 'constant learning rate ˆη and weight decay ˆλ' and 'full-batch GD', and refers to hyperparameters in Appendix Q, but it does not provide specific numerical values for hyperparameters or detailed system-level training settings in the main text.