The Flan Collection: Designing Data and Methods for Effective Instruction Tuning
Authors: Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, Adam Roberts
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through careful ablation studies on the Flan Collection of tasks and methods, we tease apart the effect of design decisions which enable Flan-T5 to outperform prior work by 3-17%+ across evaluation settings. |
| Researcher Affiliation | Collaboration | 1Media Lab, Massachusetts Institute of Technology, Cambridge, USA 2Google, Mountain View, USA. Correspondence to: Shayne Longpre <slongpre@media.mit.edu>. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1Data generation code available at: https://github. com/google-research/FLAN/tree/main/flan/v2. Generation code allows users to vary mixtures rates, templates, prompt types and data augmentations techniques, for faster public research. |
| Open Datasets | Yes | to accelerate research on instruction tuning, we make the Flan 2022 collection of datasets, templates, and methods publicly available. |
| Dataset Splits | Yes | We evaluate on (a) a suite of 8 Held-In tasks represented within the 1800+ training task collection (4 question answering and 4 natural language inference validation sets), (b) Chain-of-Thought (Co T) tasks (5 validation sets), and (c) the MMLU (Hendrycks et al., 2020) and BBH (Suzgun et al., 2022) benchmarks as our set of Held-Out tasks |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running the experiments. |
| Software Dependencies | No | The paper does not list specific software components with their version numbers (e.g., Python, PyTorch, CUDA, or specialized solvers with versions) that would be needed to replicate the experiment environment. |
| Experiment Setup | Yes | For single-task finetuning, described in Section 4, our models are finetuned for 100,000 steps for all tasks. We use a constant learning rate of 0.001, a dropout probability of 0.1, and a batch size of 128 length-512 sequences. We save a checkpoint every 20 steps and report test performance on the model checkpoint corresponding to the highest validation performance. |