reproducibilityindex.ai

Finetuned Language Models are Zero-Shot Learners

Authors: Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, Quoc V Le

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate this instruction-tuned model, which we call FLAN, on unseen task types. FLAN substantially improves the performance of its unmodiﬁed counterpart and surpasses zero-shot 175B GPT-3 on 20 of 25 datasets that we evaluate. FLAN even outperforms few-shot GPT-3 by a large margin on ANLI, RTE, Bool Q, AI2-ARC, Openbook QA, and Story Cloze. Ablation studies reveal that number of ﬁnetuning datasets, model scale, and natural language instructions are key to the success of instruction tuning.
Researcher Affiliation	Industry	Jason Wei , Maarten Bosma , Vincent Y. Zhao , Kelvin Guu , Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le Google Research
Pseudocode	No	No pseudocode or algorithm blocks are present in the paper.
Open Source Code	Yes	Source code for loading the instruction tuning dataset used for FLAN is publicly available at https://github.com/google-research/flan.
Open Datasets	Yes	We aggregate 62 text datasets that are publicly available on Tensorﬂow Datasets, including both language understanding and language generation tasks, into a single mixture. Figure 3 shows these datasets each dataset is categorized into one of twelve task clusters, for which datasets in a given cluster are of the same task type. Descriptions, sizes, and examples of each dataset are shown in Appendix G.
Dataset Splits	Yes	Of the training set with 9,427 examples, we use 9,227 for train and 200 for dev. We use the TFDS validation set of 3,270 examples as our test set for reporting numbers.
Hardware Specification	Yes	This instruction tuning takes around 60 hours on a TPUv3 with 128 cores.
Software Dependencies	No	The paper mentions 'Sentence Piece library' and 'Adafactor Optimizer' but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	Our instruction tuning pipeline mixes all datasets and randomly samples from each dataset. To balance the different sizes of datasets, we limit the number of training examples per dataset to 30k and follow the examples-proportional mixing scheme (Raffel et al., 2020) with a mixing rate maximum of 3k.2 We ﬁnetune all models for 30k gradient steps with a batch size of 8,192 tokens using the Adafactor Optimizer (Shazeer & Stern, 2018) with a learning rate of 3e-5. The input and target sequence lengths used in ﬁnetuning are 1024 and 256, respectively. We use packing (Raffel et al., 2020) to combine multiple training examples into a single sequence, separating inputs from targets using a special EOS token.