Calibrate Before Use: Improving Few-shot Performance of Language Models
Authors: Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, Sameer Singh
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We test the effectiveness of contextual calibration on a range of tasks (Section 5). Contextual calibration consistently improves GPT-3 and GPT-2 s accuracy (up to 30.0% absolute) across different choices of the prompt format and examples (e.g., Figure 1). It also makes the accuracy more stable across different prompts, thus mitigating the need for prompt engineering. Table 1 shows the results and Figure 1 in Section 1 plots the same data for a subset of the tasks. |
| Researcher Affiliation | Academia | 1UC Berkeley 2University of Maryland 3UC Irvine. Correspondence to: Eric Wallace <ericwallace@berkeley.edu>. |
| Pseudocode | No | The paper describes its method in prose but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We release code to replicate our experiments.1 1https://www.github.com/tonyzhaozh/few-shot-learning |
| Open Datasets | Yes | We study text classification using six datasets: sentiment analysis using SST-2 (Socher et al., 2013), 6-way question classification using TREC (Voorhees & Tice, 2000), textual entailment using 3-way CB (de Marneffe et al., 2019) and binary RTE (Dagan et al., 2005) from Super GLUE (Wang et al., 2019), and topic classification using the 4-way AGNews (Zhang et al., 2015) and 14-way DBPedia (Zhang et al., 2015) datasets. |
| Dataset Splits | No | The paper mentions using 'validation sets' (e.g., 'balanced validation set' in Section 4, 'on the validation set' in Section 5.2), but it does not provide specific details on the split percentages, sample counts, or the methodology for creating these splits from the main datasets. |
| Hardware Specification | No | The paper states, 'We run our experiments on three sizes of GPT-3 (2.7B, 13B, and 175B parameters) as well as GPT-2 (1.5B parameters). We access GPT-3 using the Open AI API.' This indicates reliance on an API for GPT-3, and no specific hardware details are provided for GPT-2 or the computational environment. |
| Software Dependencies | No | The paper mentions using GPT-3 and GPT-2 models via the Open AI API, but it does not specify any software dependencies with version numbers, such as programming languages, libraries, or frameworks. |
| Experiment Setup | Yes | We average the probabilities from three content-free inputs: N/A , [MASK] , and the empty string. We then set W = diag(ˆpcf) 1 and b to the all-zero vector. To make test predictions, we compute Wˆp + b and take the argmax. For classification tasks, the probability for each class is given by the probability assigned to its associated label name, e.g., the words Negative and Positive for sentiment classification. |