reproducibilityindex.ai

Improving Transformer Optimization Through Better Initialization

Authors: Xiao Shi Huang, Felipe Perez, Jimmy Ba, Maksims Volkovs

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results on public machine translation benchmarks show that our approach achieves leading accuracy, allowing to train deep Transformer models with 200 layers in both encoder and decoder (over 1000 attention/MLP blocks) without difﬁculty.
Researcher Affiliation	Collaboration	1Layer 6 AI, Toronto, ON, Canada 2University of Toronto, Toronto, ON, Canada 3Vector Institute, Toronto, ON, Canada.
Pseudocode	No	The paper includes mathematical derivations and equations but does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code for this work is available here: https://github. com/layer6ai-labs/T-Fixup.
Open Datasets	Yes	We compare Transformer trained with our initialization against leading models on multiple public NMT benchmarks including IWSLT 14 De-En, WMT 17 En-De and low resource language pair WMT 18 Fi-En.
Dataset Splits	No	The paper mentions 'public NMT benchmarks' and shows 'validation curves' but does not explicitly provide specific percentages, sample counts, or clear citations for the train/validation/test dataset splits used.
Hardware Specification	Yes	Training is done on an IBM server with 160 POWER9 CPUs, 600GB RAM and 4 Tesla V100 GPUs.
Software Dependencies	No	All experiments are done using the Fairseq library (Gehring et al., 2017). The specific version number for Fairseq or any other software dependency is not provided.
Experiment Setup	Yes	To stay consistent with previous work we train three model sizes: small 512-1024-4, base 512-2048-8 and big 1024-4096-16; where the numbers correspond to embedding dimension, MLP layer size and number of attention heads respectively. All models have 6 layers in both encoder and decoder... Hyper-parameters for each model are chosen through grid search and are listed in Appendix D. To demonstrate that our initialization works well with the Adam optimizer we use Adam for all experiments.