reproducibilityindex.ai

Selective Annotation Makes Language Models Better Few-Shot Learners

Authors: Hongjin SU, Jungo Kasai, Chen Henry Wu, Weijia Shi, Tianlu Wang, Jiayi Xin, Rui Zhang, Mari Ostendorf, Luke Zettlemoyer, Noah A. Smith, Tao Yu

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on 10 datasets (covering classification, commonsense reasoning, dialogue, and text/code generation) demonstrate that our selective annotation method improves the task performance by a large margin.
Researcher Affiliation	Collaboration	The University of Hong Kong University of Washington Allen Institute for AI Carnegie Mellon University Penn State University Meta AI
Pseudocode	Yes	Algorithm 1 Voke-k Selective Annotation
Open Source Code	Yes	1Our code is available at https://github.com/HKUNLP/icl-selective-annotation.
Open Datasets	Yes	We use 10 diverse NLP datasets across 9 tasks that are listed in Table 1. These datasets involve different task formulations, thereby allowing for extensive evaluations in varying scenarios. Some of those are included in the widely-used GLUE benchmark (Wang et al., 2019).
Dataset Splits	Yes	For each dataset, we use the standard train/dev./test split available from the Transformers library (Wolf et al., 2020).
Hardware Specification	No	The paper mentions using specific language models like GPT-J and Codex-davinci-002 and refers to 'computational budget' but does not specify the underlying hardware (e.g., specific GPU or CPU models, memory).
Software Dependencies	No	The paper mentions software like Sentence-BERT and the Transformers library, but it does not specify exact version numbers for these or any other ancillary software components.
Experiment Setup	Yes	We tuned k and ρ in our preliminary experiments, and found that k=150 and ρ=10 perform well across many datasets. For the classification and multiple-choice tasks, we compute the average log score for each choice and choose the maximum one. For generation tasks, we simply perform beam-search decoding.