Selective Annotation Makes Language Models Better Few-Shot Learners
Authors: Hongjin SU, Jungo Kasai, Chen Henry Wu, Weijia Shi, Tianlu Wang, Jiayi Xin, Rui Zhang, Mari Ostendorf, Luke Zettlemoyer, Noah A. Smith, Tao Yu
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on 10 datasets (covering classification, commonsense reasoning, dialogue, and text/code generation) demonstrate that our selective annotation method improves the task performance by a large margin. |
| Researcher Affiliation | Collaboration | The University of Hong Kong University of Washington Allen Institute for AI Carnegie Mellon University Penn State University Meta AI |
| Pseudocode | Yes | Algorithm 1 Voke-k Selective Annotation |
| Open Source Code | Yes | 1Our code is available at https://github.com/HKUNLP/icl-selective-annotation. |
| Open Datasets | Yes | We use 10 diverse NLP datasets across 9 tasks that are listed in Table 1. These datasets involve different task formulations, thereby allowing for extensive evaluations in varying scenarios. Some of those are included in the widely-used GLUE benchmark (Wang et al., 2019). |
| Dataset Splits | Yes | For each dataset, we use the standard train/dev./test split available from the Transformers library (Wolf et al., 2020). |
| Hardware Specification | No | The paper mentions using specific language models like GPT-J and Codex-davinci-002 and refers to 'computational budget' but does not specify the underlying hardware (e.g., specific GPU or CPU models, memory). |
| Software Dependencies | No | The paper mentions software like Sentence-BERT and the Transformers library, but it does not specify exact version numbers for these or any other ancillary software components. |
| Experiment Setup | Yes | We tuned k and ρ in our preliminary experiments, and found that k=150 and ρ=10 perform well across many datasets. For the classification and multiple-choice tasks, we compute the average log score for each choice and choose the maximum one. For generation tasks, we simply perform beam-search decoding. |