Generating Training Data with Language Models: Towards Zero-Shot Language Understanding
Authors: Yu Meng, Jiaxin Huang, Yu Zhang, Jiawei Han
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Across seven classification tasks of the GLUE benchmark [66], Super Gen significantly outperforms the prompt-based zero-shot method and even achieves an overall better result in both average performance and stability than strong few-shot approaches that use 32 annotated samples per class. We present the results of Super Gen, its ablations and compared methods in Table 2. |
| Researcher Affiliation | Academia | Yu Meng, Jiaxin Huang, Yu Zhang, Jiawei Han Department of Computer Science, University of Illinois at Urbana-Champaign {yumeng5,jiaxinh3,yuz9,hanj}@illinois.edu |
| Pseudocode | Yes | Algorithm 1: Super Gen for Zero-Shot Learning. |
| Open Source Code | Yes | 1Code can be found at https://github.com/yumeng5/Super Gen. |
| Open Datasets | Yes | Across seven classification tasks of the GLUE benchmark [66], Super Gen significantly outperforms the prompt-based zero-shot method and even achieves an overall better result in both average performance and stability than strong few-shot approaches that use 32 annotated samples per class. We assume the pretraining corpus D (e.g., Wikipedia) is available. |
| Dataset Splits | Yes | The original development sets of these tasks are used for testing. We follow the evaluation protocol of [13]: We use F1 score as the metric for QQP and MRPC, Matthews correlation for Co LA, and accuracy for the rest of the tasks. When the few-shot training and validation sets are rather small (32-64 samples per label in total), fine-tuning the classifier on the Super Gen generated set further (after fine-tuning on the few-shot samples) brings notable performance improvements. |
| Hardware Specification | No | The paper mentions that models are of 'moderate size to fit in typical research hardware' and refers to model sizes like 'GPT-2-sized' or 'RoBERTa Large-sized', but it does not specify concrete hardware details such as GPU models, CPU types, or memory. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers (e.g., 'PyTorch 1.9', 'Python 3.8') needed for replication, only mentioning the names of PLMs used. |
| Experiment Setup | Yes | We keep all fine-tuning hyperparameters (e.g., learning rate, batch size, training epochs, number of generated training samples, label smoothing and temporal ensembling hyperparameters) the same across all tasks. See Appendix B Table 10 for details. |