Improving Transformer Optimization Through Better Initialization

Authors: Xiao Shi Huang, Felipe Perez, Jimmy Ba, Maksims Volkovs

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical results on public machine translation benchmarks show that our approach achieves leading accuracy, allowing to train deep Transformer models with 200 layers in both encoder and decoder (over 1000 attention/MLP blocks) without difficulty.
Researcher Affiliation Collaboration 1Layer 6 AI, Toronto, ON, Canada 2University of Toronto, Toronto, ON, Canada 3Vector Institute, Toronto, ON, Canada.
Pseudocode No The paper includes mathematical derivations and equations but does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code for this work is available here: https://github. com/layer6ai-labs/T-Fixup.
Open Datasets Yes We compare Transformer trained with our initialization against leading models on multiple public NMT benchmarks including IWSLT 14 De-En, WMT 17 En-De and low resource language pair WMT 18 Fi-En.
Dataset Splits No The paper mentions 'public NMT benchmarks' and shows 'validation curves' but does not explicitly provide specific percentages, sample counts, or clear citations for the train/validation/test dataset splits used.
Hardware Specification Yes Training is done on an IBM server with 160 POWER9 CPUs, 600GB RAM and 4 Tesla V100 GPUs.
Software Dependencies No All experiments are done using the Fairseq library (Gehring et al., 2017). The specific version number for Fairseq or any other software dependency is not provided.
Experiment Setup Yes To stay consistent with previous work we train three model sizes: small 512-1024-4, base 512-2048-8 and big 1024-4096-16; where the numbers correspond to embedding dimension, MLP layer size and number of attention heads respectively. All models have 6 layers in both encoder and decoder... Hyper-parameters for each model are chosen through grid search and are listed in Appendix D. To demonstrate that our initialization works well with the Adam optimizer we use Adam for all experiments.