reproducibilityindex.ai

Uncertainty-aware Self-training for Few-shot Text Classification

Authors: Subhabrata Mukherjee, Ahmed Awadallah

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show our methods leveraging only 20-30 labeled samples per class for each task for training and for validation perform within 3% of fully supervised pre-trained language models ﬁne-tuned on thousands of labels with an aggregate accuracy of 91% and improvement of up to 12% over baselines. As an application, we focus on text classiﬁcation with ﬁve benchmark datasets. We perform large-scale experiments with data from ﬁve domains for different tasks as summarized in Table 1.
Researcher Affiliation	Industry	Subhabrata Mukherjee Microsoft Research Redmond, WA submukhe@microsoft.com Ahmed Hassan Awadallah Microsoft Research Redmond, WA hassanam@microsoft.com
Pseudocode	Yes	Algorithm 1: Uncertainty-aware self-training (UST).
Open Source Code	Yes	1Code is available at http://aka.ms/UST
Open Datasets	Yes	SST-2 [Socher et al., 2013], IMDB [Maas et al., 2011] and Elec [Mc Auley and Leskovec, 2013] are used for sentiment classiﬁcation for movie reviews and Amazon electronics product reviews respectively. The other two datasets Dbpedia [Zhang et al., 2015] and Ag News [Zhang et al., 2015] are used for topic classiﬁcation of Wikipedia and news articles respectively.
Dataset Splits	Yes	Speciﬁcally, we consider K = 30 instances for each class for training and similar for validation, that are randomly sampled from the corresponding Train data in Table 1. We repeat each experiment ﬁve times with different random seeds and data splits, use the validation split to select the best model, and report the mean accuracy on the blind test data.
Hardware Specification	Yes	We implement our framework in Tensorﬂow and use four Tesla V100 GPUs for experimentation.
Software Dependencies	No	The paper mentions using "Tensorﬂow" but does not specify a version number or other software dependencies with their versions.
Experiment Setup	Yes	Speciﬁcally, we consider K = 30 instances for each class for training and similar for validation, that are randomly sampled from the corresponding Train data in Table 1. We use Adam [Kingma and Ba, 2015] as the optimizer with early stopping and use the best model found so far from the validation loss for all the models. Hyper-parameter conﬁgurations with detailed model settings presented in Appendix. Our ﬁrst baseline is BERT-Base with 110 MM parameters ﬁne-tuned on K labeled samples Dl for downstream tasks with a small batch-size of 4 samples, and remaining hyper-parameters retained from its original implementation.