Fixup Initialization: Residual Learning Without Normalization
Authors: Hongyi Zhang, Yann N. Dauphin, Tengyu Ma
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We apply Fixup to replace batch normalization on image classification benchmarks CIFAR-10 (with Wide-Res Net) and Image Net (with Res Net), and find Fixup with proper regularization matches the well-tuned baseline trained with normalization. (Section 4.2) Machine translation. We apply Fixup to replace layer normalization on machine translation benchmarks IWSLT and WMT using the Transformer model, and find it outperforms the baseline and achieves new state-of-the-art results on the same architecture. (Section 4.3) |
| Researcher Affiliation | Collaboration | Hongyi Zhang MIT hongyiz@mit.edu Yann N. Dauphin Google Brain yann@dauphin.io Tengyu Ma Stanford University tengyuma@stanford.edu Work done at Facebook. Equal contribution. Work done at Facebook. Equal contribution. Work done at Facebook. |
| Pseudocode | No | The 'Fixup initialization' steps are presented as a numbered list within a paragraph, not in a clearly labeled 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | No | No explicit statement about releasing source code or a link to a code repository for the methodology described in the paper. |
| Open Datasets | Yes | We apply Fixup to replace batch normalization on image classification benchmarks CIFAR-10 (with Wide-Res Net) and Image Net (with Res Net)... |
| Dataset Splits | Yes | Best Mixup coefficients are found through cross-validation: they are 0.2, 0.1 and 0.7 for Batch Norm, Group Norm (Wu & He, 2018) and Fixup respectively. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory, or cloud instance types) are provided. |
| Software Dependencies | No | Specifically, we use the fairseq library (Gehring et al., 2017) and follow the Fixup template in Section 3 to modify the baseline model. (No version specified for fairseq or other libraries). |
| Experiment Setup | Yes | We use the default batch size of 128 up to 1000 layers, with a batch size of 64 for 10,000 layers. |