Language Models are Few-Shot Learners

Authors: Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even becoming competitive with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous nonsparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks.
Researcher Affiliation Collaboration Equal contribution Johns Hopkins University, Open AI
Pseudocode No The paper describes methods and processes in narrative text but does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper mentions 'Our release repository contains uncurated unconditional samples.' but does not provide a specific URL or an explicit statement of releasing the code for the methodology. It also mentions 'GPT-3 Git Hub' in the context of training data, not open-source code for the model itself.
Open Datasets Yes To create our training data, we (1) downloaded and filtered a version of Common Crawl1 [RSR+19] [https://commoncrawl.org/the-data/]... These reference corpora include an expanded version of the Web Text dataset [RWC+19]... two internet-based books corpora (Books1 and Books2) and English-language Wikipedia (details in the appendix). For evaluation: Penn Tree Bank (PTB) [MKM+94], LAMBADA dataset [PKL+16], Hella Swag dataset [ZHB+19], Story Cloze 2016 dataset [MCH+16], Trivia QA [JCWZ17], Natural Questions (NQs) [KPR+19], ARC [CCE+18], Co QA [RCM19], DROP [DWD+19], Super GLUE [WPN+19].
Dataset Splits Yes For few-shot learning, we evaluate each example in the evaluation set by randomly drawing K examples from that task s training set as conditioning... For LAMBADA and Storycloze there is no supervised training set available so we draw conditioning examples from the development set and evaluate on the test set. [...] When the test set is private, our model is often too large to fit on the test server, so we report results on the development set.
Hardware Specification Yes All models were trained on V100 GPU s on part of a high-bandwidth cluster.
Software Dependencies No The paper mentions various tools and components like 'byte-level BPE tokenizer', 'beam search', 'Senti Word Net', 'off-the-shelf POS tagger [LB02]', 'multi-bleu.perl', and 'Sacre BLEUf [Pos18]', but it does not provide specific version numbers for these software dependencies (e.g., 'PyTorch 1.9') in a way that allows for replication.
Experiment Setup Yes We typically set K in the range of 10 to 100, as this is how many examples can fit in the model s context window (nctx = 2048). [...] larger models can typically use a larger batch size, but require a smaller learning rate. We measure the gradient noise scale during training and use it to guide our choice of batch size [MKAT18]. Table A.1 shows the parameter settings we used. [...] On tasks with free-form completion, we use beam search with the same parameters as [RSR+19]: a beam width of 4 and a length penalty of α = 0.6. [...] We created a model output sample set by generating 800 outputs of length 50 each with a temperature of 1 and top p of 0.9 for every prompt in our dataset.