Calibrate Before Use: Improving Few-shot Performance of Language Models

Authors: Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, Sameer Singh

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We test the effectiveness of contextual calibration on a range of tasks (Section 5). Contextual calibration consistently improves GPT-3 and GPT-2 s accuracy (up to 30.0% absolute) across different choices of the prompt format and examples (e.g., Figure 1). It also makes the accuracy more stable across different prompts, thus mitigating the need for prompt engineering. Table 1 shows the results and Figure 1 in Section 1 plots the same data for a subset of the tasks.
Researcher Affiliation Academia 1UC Berkeley 2University of Maryland 3UC Irvine. Correspondence to: Eric Wallace <ericwallace@berkeley.edu>.
Pseudocode No The paper describes its method in prose but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes We release code to replicate our experiments.1 1https://www.github.com/tonyzhaozh/few-shot-learning
Open Datasets Yes We study text classification using six datasets: sentiment analysis using SST-2 (Socher et al., 2013), 6-way question classification using TREC (Voorhees & Tice, 2000), textual entailment using 3-way CB (de Marneffe et al., 2019) and binary RTE (Dagan et al., 2005) from Super GLUE (Wang et al., 2019), and topic classification using the 4-way AGNews (Zhang et al., 2015) and 14-way DBPedia (Zhang et al., 2015) datasets.
Dataset Splits No The paper mentions using 'validation sets' (e.g., 'balanced validation set' in Section 4, 'on the validation set' in Section 5.2), but it does not provide specific details on the split percentages, sample counts, or the methodology for creating these splits from the main datasets.
Hardware Specification No The paper states, 'We run our experiments on three sizes of GPT-3 (2.7B, 13B, and 175B parameters) as well as GPT-2 (1.5B parameters). We access GPT-3 using the Open AI API.' This indicates reliance on an API for GPT-3, and no specific hardware details are provided for GPT-2 or the computational environment.
Software Dependencies No The paper mentions using GPT-3 and GPT-2 models via the Open AI API, but it does not specify any software dependencies with version numbers, such as programming languages, libraries, or frameworks.
Experiment Setup Yes We average the probabilities from three content-free inputs: N/A , [MASK] , and the empty string. We then set W = diag(ˆpcf) 1 and b to the all-zero vector. To make test predictions, we compute Wˆp + b and take the argmax. For classification tasks, the probability for each class is given by the probability assigned to its associated label name, e.g., the words Negative and Positive for sentiment classification.