Language Models are Few-Shot Learners
Authors: Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even becoming competitive with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous nonsparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks. |
| Researcher Affiliation | Collaboration | Equal contribution Johns Hopkins University, Open AI |
| Pseudocode | No | The paper describes methods and processes in narrative text but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions 'Our release repository contains uncurated unconditional samples.' but does not provide a specific URL or an explicit statement of releasing the code for the methodology. It also mentions 'GPT-3 Git Hub' in the context of training data, not open-source code for the model itself. |
| Open Datasets | Yes | To create our training data, we (1) downloaded and filtered a version of Common Crawl1 [RSR+19] [https://commoncrawl.org/the-data/]... These reference corpora include an expanded version of the Web Text dataset [RWC+19]... two internet-based books corpora (Books1 and Books2) and English-language Wikipedia (details in the appendix). For evaluation: Penn Tree Bank (PTB) [MKM+94], LAMBADA dataset [PKL+16], Hella Swag dataset [ZHB+19], Story Cloze 2016 dataset [MCH+16], Trivia QA [JCWZ17], Natural Questions (NQs) [KPR+19], ARC [CCE+18], Co QA [RCM19], DROP [DWD+19], Super GLUE [WPN+19]. |
| Dataset Splits | Yes | For few-shot learning, we evaluate each example in the evaluation set by randomly drawing K examples from that task s training set as conditioning... For LAMBADA and Storycloze there is no supervised training set available so we draw conditioning examples from the development set and evaluate on the test set. [...] When the test set is private, our model is often too large to fit on the test server, so we report results on the development set. |
| Hardware Specification | Yes | All models were trained on V100 GPU s on part of a high-bandwidth cluster. |
| Software Dependencies | No | The paper mentions various tools and components like 'byte-level BPE tokenizer', 'beam search', 'Senti Word Net', 'off-the-shelf POS tagger [LB02]', 'multi-bleu.perl', and 'Sacre BLEUf [Pos18]', but it does not provide specific version numbers for these software dependencies (e.g., 'PyTorch 1.9') in a way that allows for replication. |
| Experiment Setup | Yes | We typically set K in the range of 10 to 100, as this is how many examples can fit in the model s context window (nctx = 2048). [...] larger models can typically use a larger batch size, but require a smaller learning rate. We measure the gradient noise scale during training and use it to guide our choice of batch size [MKAT18]. Table A.1 shows the parameter settings we used. [...] On tasks with free-form completion, we use beam search with the same parameters as [RSR+19]: a beam width of 4 and a length penalty of α = 0.6. [...] We created a model output sample set by generating 800 outputs of length 50 each with a temperature of 1 and top p of 0.9 for every prompt in our dataset. |