Finetuned Language Models are Zero-Shot Learners

Authors: Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, Quoc V Le

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate this instruction-tuned model, which we call FLAN, on unseen task types. FLAN substantially improves the performance of its unmodified counterpart and surpasses zero-shot 175B GPT-3 on 20 of 25 datasets that we evaluate. FLAN even outperforms few-shot GPT-3 by a large margin on ANLI, RTE, Bool Q, AI2-ARC, Openbook QA, and Story Cloze. Ablation studies reveal that number of finetuning datasets, model scale, and natural language instructions are key to the success of instruction tuning.
Researcher Affiliation Industry Jason Wei , Maarten Bosma , Vincent Y. Zhao , Kelvin Guu , Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le Google Research
Pseudocode No No pseudocode or algorithm blocks are present in the paper.
Open Source Code Yes Source code for loading the instruction tuning dataset used for FLAN is publicly available at https://github.com/google-research/flan.
Open Datasets Yes We aggregate 62 text datasets that are publicly available on Tensorflow Datasets, including both language understanding and language generation tasks, into a single mixture. Figure 3 shows these datasets each dataset is categorized into one of twelve task clusters, for which datasets in a given cluster are of the same task type. Descriptions, sizes, and examples of each dataset are shown in Appendix G.
Dataset Splits Yes Of the training set with 9,427 examples, we use 9,227 for train and 200 for dev. We use the TFDS validation set of 3,270 examples as our test set for reporting numbers.
Hardware Specification Yes This instruction tuning takes around 60 hours on a TPUv3 with 128 cores.
Software Dependencies No The paper mentions 'Sentence Piece library' and 'Adafactor Optimizer' but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes Our instruction tuning pipeline mixes all datasets and randomly samples from each dataset. To balance the different sizes of datasets, we limit the number of training examples per dataset to 30k and follow the examples-proportional mixing scheme (Raffel et al., 2020) with a mixing rate maximum of 3k.2 We finetune all models for 30k gradient steps with a batch size of 8,192 tokens using the Adafactor Optimizer (Shazeer & Stern, 2018) with a learning rate of 3e-5. The input and target sequence lengths used in finetuning are 1024 and 256, respectively. We use packing (Raffel et al., 2020) to combine multiple training examples into a single sequence, separating inputs from targets using a special EOS token.