Gradient Vaccine: Investigating and Improving Multi-task Optimization in Massively Multilingual Models
Authors: Zirui Wang, Yulia Tsvetkov, Orhan Firat, Yuan Cao
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, our method obtains significant model performance gains on multilingual machine translation and XTREME benchmark tasks for multilingual language models. |
| Researcher Affiliation | Collaboration | 1Carnegie Mellon University, 2Google AI |
| Pseudocode | Yes | Algorithm 1 Grad Vac Update Rule |
| Open Source Code | No | The paper does not provide an explicit statement or link for open-source code for the described methodology. |
| Open Datasets | Yes | We also experiment on publicly available dataset of WMT and obtain similar observations in Appendix C. The NER task is from the Wiki Ann (Pan et al., 2017) dataset, which is built automatically from Wikipedia. In particular, the dataset we used is from the Universal Dependencies treebanks (Nivre et al., 2018). |
| Dataset Splits | No | The paper mentions using 'validation sets' for specific analyses (e.g., WMT dataset) but does not provide details on how the main datasets are split into training, validation, and testing sets, or explicit split percentages/counts for reproduction. |
| Hardware Specification | Yes | utilize data parallelism to train all models over 64 TPUv3 chips. |
| Software Dependencies | No | The paper mentions using an Adam optimizer and SentencePiece Model, but does not specify software versions for libraries or frameworks used in implementation. |
| Experiment Setup | Yes | We use the Transformer-Big (Vaswani et al., 2017) architecture containing 375M parameters described in (Chen et al., 2018a)... We use an effective batch sizes of 500k tokens... We use a single Adam optimizer (Kingma & Ba, 2014) with default decay hyper-parameters. We warm up linearly for 30K steps to a learning rate of 1e-3, which is then decayed with the inverse square root of the number of training steps after warm-up... We set T=5 for most of our experiments. |