Uncertainty-aware Self-training for Few-shot Text Classification
Authors: Subhabrata Mukherjee, Ahmed Awadallah
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show our methods leveraging only 20-30 labeled samples per class for each task for training and for validation perform within 3% of fully supervised pre-trained language models fine-tuned on thousands of labels with an aggregate accuracy of 91% and improvement of up to 12% over baselines. As an application, we focus on text classification with five benchmark datasets. We perform large-scale experiments with data from five domains for different tasks as summarized in Table 1. |
| Researcher Affiliation | Industry | Subhabrata Mukherjee Microsoft Research Redmond, WA submukhe@microsoft.com Ahmed Hassan Awadallah Microsoft Research Redmond, WA hassanam@microsoft.com |
| Pseudocode | Yes | Algorithm 1: Uncertainty-aware self-training (UST). |
| Open Source Code | Yes | 1Code is available at http://aka.ms/UST |
| Open Datasets | Yes | SST-2 [Socher et al., 2013], IMDB [Maas et al., 2011] and Elec [Mc Auley and Leskovec, 2013] are used for sentiment classification for movie reviews and Amazon electronics product reviews respectively. The other two datasets Dbpedia [Zhang et al., 2015] and Ag News [Zhang et al., 2015] are used for topic classification of Wikipedia and news articles respectively. |
| Dataset Splits | Yes | Specifically, we consider K = 30 instances for each class for training and similar for validation, that are randomly sampled from the corresponding Train data in Table 1. We repeat each experiment five times with different random seeds and data splits, use the validation split to select the best model, and report the mean accuracy on the blind test data. |
| Hardware Specification | Yes | We implement our framework in Tensorflow and use four Tesla V100 GPUs for experimentation. |
| Software Dependencies | No | The paper mentions using "Tensorflow" but does not specify a version number or other software dependencies with their versions. |
| Experiment Setup | Yes | Specifically, we consider K = 30 instances for each class for training and similar for validation, that are randomly sampled from the corresponding Train data in Table 1. We use Adam [Kingma and Ba, 2015] as the optimizer with early stopping and use the best model found so far from the validation loss for all the models. Hyper-parameter configurations with detailed model settings presented in Appendix. Our first baseline is BERT-Base with 110 MM parameters fine-tuned on K labeled samples Dl for downstream tasks with a small batch-size of 4 samples, and remaining hyper-parameters retained from its original implementation. |