Finetuned Language Models are Zero-Shot Learners
Authors: Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, Quoc V Le
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate this instruction-tuned model, which we call FLAN, on unseen task types. FLAN substantially improves the performance of its unmodified counterpart and surpasses zero-shot 175B GPT-3 on 20 of 25 datasets that we evaluate. FLAN even outperforms few-shot GPT-3 by a large margin on ANLI, RTE, Bool Q, AI2-ARC, Openbook QA, and Story Cloze. Ablation studies reveal that number of finetuning datasets, model scale, and natural language instructions are key to the success of instruction tuning. |
| Researcher Affiliation | Industry | Jason Wei , Maarten Bosma , Vincent Y. Zhao , Kelvin Guu , Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le Google Research |
| Pseudocode | No | No pseudocode or algorithm blocks are present in the paper. |
| Open Source Code | Yes | Source code for loading the instruction tuning dataset used for FLAN is publicly available at https://github.com/google-research/flan. |
| Open Datasets | Yes | We aggregate 62 text datasets that are publicly available on Tensorflow Datasets, including both language understanding and language generation tasks, into a single mixture. Figure 3 shows these datasets each dataset is categorized into one of twelve task clusters, for which datasets in a given cluster are of the same task type. Descriptions, sizes, and examples of each dataset are shown in Appendix G. |
| Dataset Splits | Yes | Of the training set with 9,427 examples, we use 9,227 for train and 200 for dev. We use the TFDS validation set of 3,270 examples as our test set for reporting numbers. |
| Hardware Specification | Yes | This instruction tuning takes around 60 hours on a TPUv3 with 128 cores. |
| Software Dependencies | No | The paper mentions 'Sentence Piece library' and 'Adafactor Optimizer' but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | Our instruction tuning pipeline mixes all datasets and randomly samples from each dataset. To balance the different sizes of datasets, we limit the number of training examples per dataset to 30k and follow the examples-proportional mixing scheme (Raffel et al., 2020) with a mixing rate maximum of 3k.2 We finetune all models for 30k gradient steps with a batch size of 8,192 tokens using the Adafactor Optimizer (Shazeer & Stern, 2018) with a learning rate of 3e-5. The input and target sequence lengths used in finetuning are 1024 and 256, respectively. We use packing (Raffel et al., 2020) to combine multiple training examples into a single sequence, separating inputs from targets using a special EOS token. |