Fixup Initialization: Residual Learning Without Normalization

Authors: Hongyi Zhang, Yann N. Dauphin, Tengyu Ma

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We apply Fixup to replace batch normalization on image classification benchmarks CIFAR-10 (with Wide-Res Net) and Image Net (with Res Net), and find Fixup with proper regularization matches the well-tuned baseline trained with normalization. (Section 4.2) Machine translation. We apply Fixup to replace layer normalization on machine translation benchmarks IWSLT and WMT using the Transformer model, and find it outperforms the baseline and achieves new state-of-the-art results on the same architecture. (Section 4.3)
Researcher Affiliation Collaboration Hongyi Zhang MIT hongyiz@mit.edu Yann N. Dauphin Google Brain yann@dauphin.io Tengyu Ma Stanford University tengyuma@stanford.edu Work done at Facebook. Equal contribution. Work done at Facebook. Equal contribution. Work done at Facebook.
Pseudocode No The 'Fixup initialization' steps are presented as a numbered list within a paragraph, not in a clearly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code No No explicit statement about releasing source code or a link to a code repository for the methodology described in the paper.
Open Datasets Yes We apply Fixup to replace batch normalization on image classification benchmarks CIFAR-10 (with Wide-Res Net) and Image Net (with Res Net)...
Dataset Splits Yes Best Mixup coefficients are found through cross-validation: they are 0.2, 0.1 and 0.7 for Batch Norm, Group Norm (Wu & He, 2018) and Fixup respectively.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory, or cloud instance types) are provided.
Software Dependencies No Specifically, we use the fairseq library (Gehring et al., 2017) and follow the Fixup template in Section 3 to modify the baseline model. (No version specified for fairseq or other libraries).
Experiment Setup Yes We use the default batch size of 128 up to 1000 layers, with a batch size of 64 for 10,000 layers.