ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning
Authors: Vamsi Aribandi, Yi Tay, Tal Schuster, Jinfeng Rao, Huaixiu Steven Zheng, Sanket Vaibhav Mehta, Honglei Zhuang, Vinh Q. Tran, Dara Bahri, Jianmo Ni, Jai Gupta, Kai Hui, Sebastian Ruder, Donald Metzler
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Via extensive experiments, we show that EXT5 outperforms strong T5 baselines on Super GLUE, GEM, Rainbow, Closed-Book QA tasks, and several tasks outside of EXMIX. EXT5 also significantly improves sample efficiency while pre-training. |
| Researcher Affiliation | Industry | Vamsi Aribandi , Yi Tay , Tal Schuster, Jinfeng Rao, Huaixiu Steven Zheng, Sanket Vaibhav Mehta, Honglei Zhuang, Vinh Q. Tran, Dara Bahri, Jianmo Ni, Jai Gupta, Kai Hui, Sebastian Ruder , Donald Metzler Google Research, Deep Mind {aribandi, yitay}@google.com |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | All of the modeling and training code used for EXT5 and its variants is already open-sourced as a part of the Mesh Tensorflow1 (Shazeer et al., 2018) and T52 (Raffel et al., 2020) Libraries. |
| Open Datasets | Yes | Additionally, EXMIX is composed of datasets that are already publicly available. (referring to Table 11 and its subsequent text which lists datasets with citations like Wang et al. (2019b) and See et al. (2017)). |
| Dataset Splits | Yes | We report test set results on all datasets except Common Gen and To TTo, on which we report validation scores. ... For each dataset, we select the best model checkpoint using average of BLEU, ROUGE-1, ROUGE-2 and ROUGE-L scores on the validation set. |
| Hardware Specification | No | The paper does not specify any particular hardware (e.g., GPU model, CPU type, TPU version) used for running the experiments. It mentions training for '1M total steps with a batch size of 2048' but no hardware specifics. |
| Software Dependencies | No | Our models were trained using Mesh Tensorflow (Shazeer et al., 2018) using the T5 library (Raffel et al., 2020). While software names are mentioned, specific version numbers for these libraries are not provided. |
| Experiment Setup | Yes | We pre-train our models for 1M total steps with a batch size of 2048 and sequence length 512... For optimization, we use Adafactor with an inverse square root learning rate schedule that kicks in after a a constant phase of 0.01 for 10k steps. ...Fine-tuning... EXT5 generally benefitted from a smaller learning rate while fine-tuning (10 4 worked well for EXT5 vs 10 3 for T5 variants). |