reproducibilityindex.ai

FLEX: Unifying Evaluation for Few-Shot NLP

Authors: Jonathan Bragg, Arman Cohan, Kyle Lo, Iz Beltagy

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In response, we formulate the FLEX Principles, a set of requirements and best practices for uniﬁed, rigorous, valid, and cost-sensitive few-shot NLP evaluation. These principles include Sample Size Design, a novel approach to benchmark design that optimizes statistical accuracy and precision while keeping evaluation costs manageable. Following the principles, we release the FLEX benchmark,2 which includes four few-shot transfer settings, zero-shot evaluation, and a public leaderboard that covers diverse NLP tasks. In addition, we present Uni Few,3 a prompt-based model for few-shot learning that uniﬁes pretraining and ﬁnetuning prompt formats, eschewing complex machinery of recent prompt-based approaches in adapting downstream task formats to language model pretraining objectives. We demonstrate that despite simplicity, Uni Few achieves results competitive with both popular meta-learning and prompt-based approaches. (Abstract)
Researcher Affiliation	Industry	Jonathan Bragg Arman Cohan Kyle Lo Iz Beltagy Allen Institute for AI, Seattle, WA {jbragg,armanc,kylel,beltagy}@allenai.org
Pseudocode	No	The paper describes its methods but does not include any explicit pseudocode or algorithm blocks.
Open Source Code	Yes	2Benchmark, leaderboard, and benchmark creation toolkit: https://github.com/allenai/flex. Apache License 2.0 3Few-shot model: https://github.com/allenai/unifew. Apache License 2.0 (Page 1 Footnotes)
Open Datasets	Yes	Our framework makes available a wide range of community-contributed NLP datasets and utilities via Hugging Face Datasets [42].13 (Section 4.4)
Dataset Splits	Yes	We carefully selected and split datasets across meta-train/validation/test in order to enable testing of Class, Domain, and Task transfer with a single meta-training phase (to reduce computational burden). Datasets involved in each transfer setting (detailed split information in Table 4 in Appendix A): (Section 4.2, Meta-Evaluation Protocols)
Hardware Specification	Yes	Costs estimated using a Quadro RTX-8000 GPU with 48Gb memory. (Section 5, Sample Size Design) We used NVidia RTX8000 GPUs, which take about 7 GPU-hours for meta-training and 48 GPU-hours for meta-testing. (Section 6, Training details)
Software Dependencies	No	Our framework makes available a wide range of community-contributed NLP datasets and utilities via Hugging Face Datasets [42]. (Section 4.4) Uni Few (1) converts examples into multiple-choice question-answer (QA) format, and (2) uses Uniﬁed QA [34], a T5 [51] model further pretrained on a large collection of QA pairs. (Section 6) The paper mentions the use of Hugging Face Datasets, Uniﬁed QA, and T5 models, but does not specify version numbers for these software components or other key libraries like PyTorch.
Experiment Setup	Yes	For meta-training and meta-validation of Uni Few, we sampled Etrain and Eval with 5-class, 5-training-shot sampling with the same number of shots per class.21 We trained the model for total number of 30K steps, using a linear learning rate scheduler with peak rate of 3e 5, 200 warmup steps, and batch size of 4; we selected the best checkpoint based on Eval performance. At meta-test time, for each episode, we trained the model on the episode s training examples (if they exist) and predicted the outputs on test examples. For training at meta-test time, we used constant learning rate of 3e 5 and batch size of 4, and trained the model for 400 steps.22 (Section 6, Training details)