reproducibilityindex.ai

Generating Training Data with Language Models: Towards Zero-Shot Language Understanding

Authors: Yu Meng, Jiaxin Huang, Yu Zhang, Jiawei Han

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Across seven classification tasks of the GLUE benchmark [66], Super Gen significantly outperforms the prompt-based zero-shot method and even achieves an overall better result in both average performance and stability than strong few-shot approaches that use 32 annotated samples per class. We present the results of Super Gen, its ablations and compared methods in Table 2.
Researcher Affiliation	Academia	Yu Meng, Jiaxin Huang, Yu Zhang, Jiawei Han Department of Computer Science, University of Illinois at Urbana-Champaign {yumeng5,jiaxinh3,yuz9,hanj}@illinois.edu
Pseudocode	Yes	Algorithm 1: Super Gen for Zero-Shot Learning.
Open Source Code	Yes	1Code can be found at https://github.com/yumeng5/Super Gen.
Open Datasets	Yes	Across seven classification tasks of the GLUE benchmark [66], Super Gen significantly outperforms the prompt-based zero-shot method and even achieves an overall better result in both average performance and stability than strong few-shot approaches that use 32 annotated samples per class. We assume the pretraining corpus D (e.g., Wikipedia) is available.
Dataset Splits	Yes	The original development sets of these tasks are used for testing. We follow the evaluation protocol of [13]: We use F1 score as the metric for QQP and MRPC, Matthews correlation for Co LA, and accuracy for the rest of the tasks. When the few-shot training and validation sets are rather small (32-64 samples per label in total), fine-tuning the classifier on the Super Gen generated set further (after fine-tuning on the few-shot samples) brings notable performance improvements.
Hardware Specification	No	The paper mentions that models are of 'moderate size to fit in typical research hardware' and refers to model sizes like 'GPT-2-sized' or 'RoBERTa Large-sized', but it does not specify concrete hardware details such as GPU models, CPU types, or memory.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers (e.g., 'PyTorch 1.9', 'Python 3.8') needed for replication, only mentioning the names of PLMs used.
Experiment Setup	Yes	We keep all fine-tuning hyperparameters (e.g., learning rate, batch size, training epochs, number of generated training samples, label smoothing and temporal ensembling hyperparameters) the same across all tasks. See Appendix B Table 10 for details.