Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Selective Annotation Makes Language Models Better Few-Shot Learners
Authors: Hongjin SU, Jungo Kasai, Chen Henry Wu, Weijia Shi, Tianlu Wang, Jiayi Xin, Rui Zhang, Mari Ostendorf, Luke Zettlemoyer, Noah A. Smith, Tao Yu
ICLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on 10 datasets (covering classification, commonsense reasoning, dialogue, and text/code generation) demonstrate that our selective annotation method improves the task performance by a large margin. |
| Researcher Affiliation | Collaboration | The University of Hong Kong University of Washington Allen Institute for AI Carnegie Mellon University Penn State University Meta AI |
| Pseudocode | Yes | Algorithm 1 Voke-k Selective Annotation |
| Open Source Code | Yes | 1Our code is available at https://github.com/HKUNLP/icl-selective-annotation. |
| Open Datasets | Yes | We use 10 diverse NLP datasets across 9 tasks that are listed in Table 1. These datasets involve different task formulations, thereby allowing for extensive evaluations in varying scenarios. Some of those are included in the widely-used GLUE benchmark (Wang et al., 2019). |
| Dataset Splits | Yes | For each dataset, we use the standard train/dev./test split available from the Transformers library (Wolf et al., 2020). |
| Hardware Specification | No | The paper mentions using specific language models like GPT-J and Codex-davinci-002 and refers to 'computational budget' but does not specify the underlying hardware (e.g., specific GPU or CPU models, memory). |
| Software Dependencies | No | The paper mentions software like Sentence-BERT and the Transformers library, but it does not specify exact version numbers for these or any other ancillary software components. |
| Experiment Setup | Yes | We tuned k and ρ in our preliminary experiments, and found that k=150 and ρ=10 perform well across many datasets. For the classification and multiple-choice tasks, we compute the average log score for each choice and choose the maximum one. For generation tasks, we simply perform beam-search decoding. |