Gradient Vaccine: Investigating and Improving Multi-task Optimization in Massively Multilingual Models

Authors: Zirui Wang, Yulia Tsvetkov, Orhan Firat, Yuan Cao

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, our method obtains significant model performance gains on multilingual machine translation and XTREME benchmark tasks for multilingual language models.
Researcher Affiliation Collaboration 1Carnegie Mellon University, 2Google AI
Pseudocode Yes Algorithm 1 Grad Vac Update Rule
Open Source Code No The paper does not provide an explicit statement or link for open-source code for the described methodology.
Open Datasets Yes We also experiment on publicly available dataset of WMT and obtain similar observations in Appendix C. The NER task is from the Wiki Ann (Pan et al., 2017) dataset, which is built automatically from Wikipedia. In particular, the dataset we used is from the Universal Dependencies treebanks (Nivre et al., 2018).
Dataset Splits No The paper mentions using 'validation sets' for specific analyses (e.g., WMT dataset) but does not provide details on how the main datasets are split into training, validation, and testing sets, or explicit split percentages/counts for reproduction.
Hardware Specification Yes utilize data parallelism to train all models over 64 TPUv3 chips.
Software Dependencies No The paper mentions using an Adam optimizer and SentencePiece Model, but does not specify software versions for libraries or frameworks used in implementation.
Experiment Setup Yes We use the Transformer-Big (Vaswani et al., 2017) architecture containing 375M parameters described in (Chen et al., 2018a)... We use an effective batch sizes of 500k tokens... We use a single Adam optimizer (Kingma & Ba, 2014) with default decay hyper-parameters. We warm up linearly for 30K steps to a learning rate of 1e-3, which is then decayed with the inverse square root of the number of training steps after warm-up... We set T=5 for most of our experiments.