reproducibilityindex.ai

Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning

Authors: Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, Colin A. Raffel

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we rigorously compare few-shot ICL and PEFT and demonstrate that the latter offers better accuracy as well as dramatically lower computational costs. Along the way, we introduce a new PEFT method called (IA)3 that scales activations by learned vectors, attaining stronger performance while only introducing a relatively tiny amount of new parameters. We also propose a simple recipe based on the T0 model [1] called T-Few that can be applied to new tasks without task-specific tuning or modifications. We validate the effectiveness of T-Few on completely unseen tasks by applying it to the RAFT benchmark [2], attaining super-human performance for the first time and outperforming the state-of-the-art by 6% absolute. All of the code used in our experiments is publicly available.
Researcher Affiliation	Academia	Department of Computer Science University of North Carolina at Chapel Hill {haokunl,dtredsox,muqeeth,craffel}@cs.unc.edu
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	All of the code used in our experiments is publicly available.1 1 https://github.com/r-three/t-few
Open Datasets	Yes	To test T0 s generalization, Sanh et al. [1] chose a set of tasks (and corresponding datasets) to hold out from the multitask training mixture specifically, sentence completion (COPA [37], H-SWAG [38], and Story Cloze [39] datasets), natural language inference (ANLI [40], CB [41], and RTE [42]), coreference resolution (WSC [43] and Winogrande [44]), and word sense disambiguation (Wi C [45]). We also will later test T-Few s abilities in the RAFT benchmark [2] in section 4.3, a collection of unseen real-world few-shot tasks with no validation set and a held-out test set. We prompt examples using a randomly-sampled prompt template from P3 Bach et al. [35] for each example at each step.
Dataset Splits	Yes	To ease comparison, we use the same number of few-shot training examples for each dataset as Brown et al. [4], which varies from 20 to 70. Unfortunately, the few-shot dataset subsets used by Brown et al. [4] have not been publicly disclosed. To allow for a more robust comparison, we therefore constructed five few-shot datasets by sampling subsets with different seeds and report the median and interquartile range. We prompt examples using a randomly-sampled prompt template from P3 Bach et al. [35] for each example at each step. Unless otherwise stated, we train our model for 1K steps with a batch size of 8 and report performance at the end of training. For all datasets, we report the accuracy on the test set or validation set when the test labels are not public (e.g. Super GLUE datasets). In the main text, we report median accuracy across the nine datasets mentioned above. Detailed results on each dataset are provided in the appendices.
Hardware Specification	Yes	We also found that fine-tuning T0 with T-Few on a single dataset only takes about a half an hour on a single NVIDIA A100 GPU. As of writing, this would cost about $2 USD using Microsoft Azure. While not insignificant, this is only about 20 times larger than the FLOPs required to process a single example with few-shot ICL using GPT-3 175B. In other words, training T-Few costs as much as using GPT-3 175B to process 20 examples with few-shot ICL. However, as mentioned above, a single 80GB A100 GPU is enough for T-Few. We thank Brian Lester and Noah Constant for helpful discussion on debugging prompt tuning and Rabeeh Karimi Mahabadi for help with Compacter and Intrinsic SAID. We also thank Stella Biderman and the Google TPU Research Cloud who provided valuable computational resources to support this work.
Software Dependencies	No	The paper mentions using 'Hugging Face Transformers [36]' but does not specify its version number, nor does it list versions for other crucial software components like Python or PyTorch, which are necessary for full reproducibility.
Experiment Setup	Yes	Unless otherwise stated, we train our model for 1K steps with a batch size of 8 and report performance at the end of training. We train for 1,000 steps with a batch size of 8 sequences using the Adafactor optimizer [49] with a learning rate of 3e 3 and a linear decay schedule with a 60-step warmup. We apply prompt templates to downstream datasets during training and inference to convert each example into an instructive text-to-text format. Importantly, we apply this recipe to every downstream dataset in exactly the same way without per-dataset hyperparameter tuning or modifications.