Scale Efficiently: Insights from Pretraining and Finetuning Transformers
Authors: Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, Donald Metzler
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To this end, we conduct extensive experiments involving pre-training and fine-tuning over 200 transformer configurations ranging from 5M to 30B parameters. To the best of our knowledge, this is the largest empirical study of practical scaling of transformer to date that considers both upstream and practical downstream transfer. |
| Researcher Affiliation | Industry | Yi Tay , Mostafa Dehghani , Jinfeng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama , Ashish Vaswani, Donald Metzler Google Research & Deep Mind {yitay,dehghani}@google.com |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | Yes | The checkpoints and code will be released at https://github.com/google-research/ google-research/tree/master/scaling_transformers. The checkpoints are now publicly available at our Google Cloud Bucket gs://scenic-bucket/ scaling_explorer/scaling_explorer. More recently, these checkpoints are also now available on Huggingface https://huggingface.co/models?other= deep-narrow. |
| Open Datasets | Yes | We pretrain on the Colossal Cleaned Common Crawl Corpus (C4; Raffel et al., 2019). We finetune on a mixture of GLUE (Wang et al., 2018), Super GLUE (Wang et al., 2019), SQu AD (Rajpurkar et al., 2016)... We pre-train Vi T on the JFT dataset (Sun et al., 2017)... We evaluate our model on Image Net 10-shot classification. |
| Dataset Splits | Yes | As a substantiating point and additional context to Figure 1, we also show via a counter-example that pretraining perplexity is not indicative of transfer performance, i.e., we explicitly show that a case (in Table 3) where a model can have outstanding pre-training perplexity as measured by validation perplexity but substantially undeliver when it comes to downstream performance. All of the downstream results are plot with Super GLUE accuracy (Wang et al., 2019) as the Y-axis. |
| Hardware Specification | Yes | We pretrain all our models for 2^19 steps using 16 TPU-v3 chips. For larger models, we run our models with 64 TPU-v3 chips. Finetuning is typically performed with 16 TPU-v3 chips. All models are trained with the same batch size using 64 TPU-V3 chips. |
| Software Dependencies | No | The paper mentions 'Mesh TensorFlow' and the 'T5 library' but does not specify version numbers for these software components, which is necessary for reproducibility. |
| Experiment Setup | Yes | We pretrain all our models for 2^19 steps using 16 TPU-v3 chips. This finetuning protocol uses a constant learning rate of 10^-3 and a batch size of 128 for all tasks. |