Improving Transformer Optimization Through Better Initialization
Authors: Xiao Shi Huang, Felipe Perez, Jimmy Ba, Maksims Volkovs
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results on public machine translation benchmarks show that our approach achieves leading accuracy, allowing to train deep Transformer models with 200 layers in both encoder and decoder (over 1000 attention/MLP blocks) without difficulty. |
| Researcher Affiliation | Collaboration | 1Layer 6 AI, Toronto, ON, Canada 2University of Toronto, Toronto, ON, Canada 3Vector Institute, Toronto, ON, Canada. |
| Pseudocode | No | The paper includes mathematical derivations and equations but does not contain explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code for this work is available here: https://github. com/layer6ai-labs/T-Fixup. |
| Open Datasets | Yes | We compare Transformer trained with our initialization against leading models on multiple public NMT benchmarks including IWSLT 14 De-En, WMT 17 En-De and low resource language pair WMT 18 Fi-En. |
| Dataset Splits | No | The paper mentions 'public NMT benchmarks' and shows 'validation curves' but does not explicitly provide specific percentages, sample counts, or clear citations for the train/validation/test dataset splits used. |
| Hardware Specification | Yes | Training is done on an IBM server with 160 POWER9 CPUs, 600GB RAM and 4 Tesla V100 GPUs. |
| Software Dependencies | No | All experiments are done using the Fairseq library (Gehring et al., 2017). The specific version number for Fairseq or any other software dependency is not provided. |
| Experiment Setup | Yes | To stay consistent with previous work we train three model sizes: small 512-1024-4, base 512-2048-8 and big 1024-4096-16; where the numbers correspond to embedding dimension, MLP layer size and number of attention heads respectively. All models have 6 layers in both encoder and decoder... Hyper-parameters for each model are chosen through grid search and are listed in Appendix D. To demonstrate that our initialization works well with the Adam optimizer we use Adam for all experiments. |