FLEX: Unifying Evaluation for Few-Shot NLP
Authors: Jonathan Bragg, Arman Cohan, Kyle Lo, Iz Beltagy
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In response, we formulate the FLEX Principles, a set of requirements and best practices for unified, rigorous, valid, and cost-sensitive few-shot NLP evaluation. These principles include Sample Size Design, a novel approach to benchmark design that optimizes statistical accuracy and precision while keeping evaluation costs manageable. Following the principles, we release the FLEX benchmark,2 which includes four few-shot transfer settings, zero-shot evaluation, and a public leaderboard that covers diverse NLP tasks. In addition, we present Uni Few,3 a prompt-based model for few-shot learning that unifies pretraining and finetuning prompt formats, eschewing complex machinery of recent prompt-based approaches in adapting downstream task formats to language model pretraining objectives. We demonstrate that despite simplicity, Uni Few achieves results competitive with both popular meta-learning and prompt-based approaches. (Abstract) |
| Researcher Affiliation | Industry | Jonathan Bragg Arman Cohan Kyle Lo Iz Beltagy Allen Institute for AI, Seattle, WA {jbragg,armanc,kylel,beltagy}@allenai.org |
| Pseudocode | No | The paper describes its methods but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | 2Benchmark, leaderboard, and benchmark creation toolkit: https://github.com/allenai/flex. Apache License 2.0 3Few-shot model: https://github.com/allenai/unifew. Apache License 2.0 (Page 1 Footnotes) |
| Open Datasets | Yes | Our framework makes available a wide range of community-contributed NLP datasets and utilities via Hugging Face Datasets [42].13 (Section 4.4) |
| Dataset Splits | Yes | We carefully selected and split datasets across meta-train/validation/test in order to enable testing of Class, Domain, and Task transfer with a single meta-training phase (to reduce computational burden). Datasets involved in each transfer setting (detailed split information in Table 4 in Appendix A): (Section 4.2, Meta-Evaluation Protocols) |
| Hardware Specification | Yes | Costs estimated using a Quadro RTX-8000 GPU with 48Gb memory. (Section 5, Sample Size Design) We used NVidia RTX8000 GPUs, which take about 7 GPU-hours for meta-training and 48 GPU-hours for meta-testing. (Section 6, Training details) |
| Software Dependencies | No | Our framework makes available a wide range of community-contributed NLP datasets and utilities via Hugging Face Datasets [42]. (Section 4.4) Uni Few (1) converts examples into multiple-choice question-answer (QA) format, and (2) uses Unified QA [34], a T5 [51] model further pretrained on a large collection of QA pairs. (Section 6) The paper mentions the use of Hugging Face Datasets, Unified QA, and T5 models, but does not specify version numbers for these software components or other key libraries like PyTorch. |
| Experiment Setup | Yes | For meta-training and meta-validation of Uni Few, we sampled Etrain and Eval with 5-class, 5-training-shot sampling with the same number of shots per class.21 We trained the model for total number of 30K steps, using a linear learning rate scheduler with peak rate of 3e 5, 200 warmup steps, and batch size of 4; we selected the best checkpoint based on Eval performance. At meta-test time, for each episode, we trained the model on the episode s training examples (if they exist) and predicted the outputs on test examples. For training at meta-test time, we used constant learning rate of 3e 5 and batch size of 4, and trained the model for 400 steps.22 (Section 6, Training details) |