reproducibilityindex.ai

Calibrate Before Use: Improving Few-shot Performance of Language Models

Authors: Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, Sameer Singh

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We test the effectiveness of contextual calibration on a range of tasks (Section 5). Contextual calibration consistently improves GPT-3 and GPT-2 s accuracy (up to 30.0% absolute) across different choices of the prompt format and examples (e.g., Figure 1). It also makes the accuracy more stable across different prompts, thus mitigating the need for prompt engineering. Table 1 shows the results and Figure 1 in Section 1 plots the same data for a subset of the tasks.
Researcher Affiliation	Academia	1UC Berkeley 2University of Maryland 3UC Irvine. Correspondence to: Eric Wallace <ericwallace@berkeley.edu>.
Pseudocode	No	The paper describes its method in prose but does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	We release code to replicate our experiments.1 1https://www.github.com/tonyzhaozh/few-shot-learning
Open Datasets	Yes	We study text classiﬁcation using six datasets: sentiment analysis using SST-2 (Socher et al., 2013), 6-way question classiﬁcation using TREC (Voorhees & Tice, 2000), textual entailment using 3-way CB (de Marneffe et al., 2019) and binary RTE (Dagan et al., 2005) from Super GLUE (Wang et al., 2019), and topic classiﬁcation using the 4-way AGNews (Zhang et al., 2015) and 14-way DBPedia (Zhang et al., 2015) datasets.
Dataset Splits	No	The paper mentions using 'validation sets' (e.g., 'balanced validation set' in Section 4, 'on the validation set' in Section 5.2), but it does not provide specific details on the split percentages, sample counts, or the methodology for creating these splits from the main datasets.
Hardware Specification	No	The paper states, 'We run our experiments on three sizes of GPT-3 (2.7B, 13B, and 175B parameters) as well as GPT-2 (1.5B parameters). We access GPT-3 using the Open AI API.' This indicates reliance on an API for GPT-3, and no specific hardware details are provided for GPT-2 or the computational environment.
Software Dependencies	No	The paper mentions using GPT-3 and GPT-2 models via the Open AI API, but it does not specify any software dependencies with version numbers, such as programming languages, libraries, or frameworks.
Experiment Setup	Yes	We average the probabilities from three content-free inputs: N/A , [MASK] , and the empty string. We then set W = diag(ˆpcf) 1 and b to the all-zero vector. To make test predictions, we compute Wˆp + b and take the argmax. For classiﬁcation tasks, the probability for each class is given by the probability assigned to its associated label name, e.g., the words Negative and Positive for sentiment classiﬁcation.