Uncertainty-aware Self-training for Few-shot Text Classification

Authors: Subhabrata Mukherjee, Ahmed Awadallah

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show our methods leveraging only 20-30 labeled samples per class for each task for training and for validation perform within 3% of fully supervised pre-trained language models fine-tuned on thousands of labels with an aggregate accuracy of 91% and improvement of up to 12% over baselines. As an application, we focus on text classification with five benchmark datasets. We perform large-scale experiments with data from five domains for different tasks as summarized in Table 1.
Researcher Affiliation Industry Subhabrata Mukherjee Microsoft Research Redmond, WA submukhe@microsoft.com Ahmed Hassan Awadallah Microsoft Research Redmond, WA hassanam@microsoft.com
Pseudocode Yes Algorithm 1: Uncertainty-aware self-training (UST).
Open Source Code Yes 1Code is available at http://aka.ms/UST
Open Datasets Yes SST-2 [Socher et al., 2013], IMDB [Maas et al., 2011] and Elec [Mc Auley and Leskovec, 2013] are used for sentiment classification for movie reviews and Amazon electronics product reviews respectively. The other two datasets Dbpedia [Zhang et al., 2015] and Ag News [Zhang et al., 2015] are used for topic classification of Wikipedia and news articles respectively.
Dataset Splits Yes Specifically, we consider K = 30 instances for each class for training and similar for validation, that are randomly sampled from the corresponding Train data in Table 1. We repeat each experiment five times with different random seeds and data splits, use the validation split to select the best model, and report the mean accuracy on the blind test data.
Hardware Specification Yes We implement our framework in Tensorflow and use four Tesla V100 GPUs for experimentation.
Software Dependencies No The paper mentions using "Tensorflow" but does not specify a version number or other software dependencies with their versions.
Experiment Setup Yes Specifically, we consider K = 30 instances for each class for training and similar for validation, that are randomly sampled from the corresponding Train data in Table 1. We use Adam [Kingma and Ba, 2015] as the optimizer with early stopping and use the best model found so far from the validation loss for all the models. Hyper-parameter configurations with detailed model settings presented in Appendix. Our first baseline is BERT-Base with 110 MM parameters fine-tuned on K labeled samples Dl for downstream tasks with a small batch-size of 4 samples, and remaining hyper-parameters retained from its original implementation.