reproducibilityindex.ai

Gradient Vaccine: Investigating and Improving Multi-task Optimization in Massively Multilingual Models

Authors: Zirui Wang, Yulia Tsvetkov, Orhan Firat, Yuan Cao

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, our method obtains significant model performance gains on multilingual machine translation and XTREME benchmark tasks for multilingual language models.
Researcher Affiliation	Collaboration	1Carnegie Mellon University, 2Google AI
Pseudocode	Yes	Algorithm 1 Grad Vac Update Rule
Open Source Code	No	The paper does not provide an explicit statement or link for open-source code for the described methodology.
Open Datasets	Yes	We also experiment on publicly available dataset of WMT and obtain similar observations in Appendix C. The NER task is from the Wiki Ann (Pan et al., 2017) dataset, which is built automatically from Wikipedia. In particular, the dataset we used is from the Universal Dependencies treebanks (Nivre et al., 2018).
Dataset Splits	No	The paper mentions using 'validation sets' for specific analyses (e.g., WMT dataset) but does not provide details on how the main datasets are split into training, validation, and testing sets, or explicit split percentages/counts for reproduction.
Hardware Specification	Yes	utilize data parallelism to train all models over 64 TPUv3 chips.
Software Dependencies	No	The paper mentions using an Adam optimizer and SentencePiece Model, but does not specify software versions for libraries or frameworks used in implementation.
Experiment Setup	Yes	We use the Transformer-Big (Vaswani et al., 2017) architecture containing 375M parameters described in (Chen et al., 2018a)... We use an effective batch sizes of 500k tokens... We use a single Adam optimizer (Kingma & Ba, 2014) with default decay hyper-parameters. We warm up linearly for 30K steps to a learning rate of 1e-3, which is then decayed with the inverse square root of the number of training steps after warm-up... We set T=5 for most of our experiments.