True Few-Shot Learning with Language Models

Authors: Ethan Perez, Douwe Kiela, Kyunghyun Cho

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Here, we evaluate the few-shot ability of LMs when such held-out examples are unavailable, a setting we call true few-shot learning. We test two model selection criteria, cross-validation and minimum description length, for choosing LM prompts and hyperparameters in the true few-shot setting.
Researcher Affiliation Collaboration Ethan Perez1, Douwe Kiela2, Kyunghyun Cho13 1New York University, 2Facebook AI Research, 3CIFAR Fellow in Learning in Machines & Brains perez@nyu.edu
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes The supplementary material contains the code to reproduce all results and plots in our paper.
Open Datasets Yes In what follows, we test on LAMA [17], a benchmark for retrieving facts with LMs... We evaluate the accuracy of GPT models when using prompts chosen by CV, MDL, and test accuracy, as we did for LAMA. For each task, we evaluate held-out accuracy using the full validation set when using 5 training examples randomly sampled from the task train set... We choose two hyperparameters in a true few-shot manner: the early stopping checkpoint and fraction of words masked for the masked LM objective. ADAPET performs T = 1000 gradient updates on batches of 16 examples and chooses the checkpoint at T {250, 500, 750, 1000} with the highest validation accuracy. ADAPET also chooses the best masking fraction M {0.075, 0.10, 0.105, 0.15}. Following ADAPET, we evaluate on Super GLUE [62], a suite of 8 NLP tasks.
Dataset Splits Yes CV randomly partitions Dtrain into K equally-sized folds F(Dtrain)1, . . . , F(Dtrain)K and evaluates the average loss on a validation fold F(Dtrain)k when training on the remaining data F(Dtrain) k:... We use K = N folds (where N is the number of training examples) for both MDL and CV (here, LOOCV). Here, N-fold CV requires N forward passes to evaluate the loss on each of the N examples when conditioning on the N 1 other examples. ... We use N = 5 examples to limit N!.
Hardware Specification Yes We detail all computation used for our experiments, in terms of GPU time and number of GPUs. Table 3 lists the GPU (and number of GPUs) used for each model. For example, GPT-3 175B requires 1120 A100-80GB GPU hours. We used a cluster of NVIDIA DGX-A100 machines.
Software Dependencies No The paper mentions 'Hugging Face Transformers [59] via Py Torch [60]' but does not provide explicit version numbers for these software components in the text.
Experiment Setup Yes We test the 5-shot accuracy of 9 popular LMs... We use N = 5 examples to limit N!. ... We choose two hyperparameters in a true few-shot manner: the early stopping checkpoint and fraction of words masked for the masked LM objective. ADAPET performs T = 1000 gradient updates on batches of 16 examples and chooses the checkpoint at T {250, 500, 750, 1000} with the highest validation accuracy. ADAPET also chooses the best masking fraction M {0.075, 0.10, 0.105, 0.15}.