Improved Text Classification via Contrastive Adversarial Training

Authors: Lin Pan, Chung-Wei Hang, Avirup Sil, Saloni Potdar11130-11138

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On several GLUE benchmark tasks, our fine-tuned BERTLarge model outperforms BERTLarge baseline by 1.7% on average, and our fine-tuned Ro BERTa Large improves over Ro BERTa Large baseline by 1.3%. We additionally validate our method in different domains using three intent classification datasets, where our fine-tuned Ro BERTa Large outperforms Ro BERTa Large baseline by 1 2% on average.
Researcher Affiliation Industry IBM Watson1 IBM Research AI2 {panl, hangc, avi, potdars}@us.ibm.com
Pseudocode No The paper describes the method using text and mathematical formulas, but it does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statement about releasing the source code for the methodology described, nor does it include a link to a code repository.
Open Datasets Yes We conduct experiments on seven tasks of the GLUE benchmark, including textual entailment (MNLI, RTE), question answering/entailment (QNLI), question paraphrase (QQP), paraphrase (MRPC), grammatical correctness (Co LA), and sentiment analysis (SST-2). Table 1 summarizes the statistics of the GLUE tasks. We additionally experiment on three commonly used intent classification datasets CLINC (Larson et al. 2019), BANKING (Casanueva et al. 2020) and HWU (Liu et al. 2019a).
Dataset Splits Yes Dataset Task Labels Train Metric Train avg length Dev avg length MNLI Textual entailment 3 393k Accuracy 29 28 QQP Question paraphrase 2 364k Accuracy 21 21 QNLI Question answering/Textual entailment 2 105k Accuracy 35 37 MRPC Paraphrase 2 3.k F1 38 39 RTE Textual entailment 2 2.5k Accuracy 51 50 Co LA Grammatical correctness 2 8.5k MCC 8 8 SST-2 Sentiment analysis 2 67k Accuracy 9 17
Hardware Specification Yes All our experiments were run on a single 32 GB V100 GPU.
Software Dependencies No The paper mentions software components like 'Adam W optimizer' and 'BERT'/'RoBERTa', but it does not provide specific version numbers for any of these software dependencies.
Experiment Setup Yes For all experiments, we use Adam W optimizer with 0.01 weight decay and a linear learning rate scheduler. We set max sequence length to 128 and learning rate warmup for the first 10% of the total iterations. For BERTLarge, we set batch size to 32 and fine-tune for 3 epochs. Grid search is performed over lr {0.00001, 0.00002, 0.00003}. For Ro BERTa Large, we sweep over the same learning rates as BERTLarge and batch size {16, 32}. For fine-tuning with CAT, we use the exact same hyperparameter settings as the baseline, and further perform grid search over ϵ {0.0001, 0.001, 0.005, 0.02}, τ {0.05, 0.06, 0.07, 0.08, 0.09, 0.1}, and λ {0.1, 0.2, 0.3, 0.4, 0.5}.