reproducibilityindex.ai

Improved Text Classification via Contrastive Adversarial Training

Authors: Lin Pan, Chung-Wei Hang, Avirup Sil, Saloni Potdar11130-11138

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On several GLUE benchmark tasks, our ﬁne-tuned BERTLarge model outperforms BERTLarge baseline by 1.7% on average, and our ﬁne-tuned Ro BERTa Large improves over Ro BERTa Large baseline by 1.3%. We additionally validate our method in different domains using three intent classiﬁcation datasets, where our ﬁne-tuned Ro BERTa Large outperforms Ro BERTa Large baseline by 1 2% on average.
Researcher Affiliation	Industry	IBM Watson1 IBM Research AI2 {panl, hangc, avi, potdars}@us.ibm.com
Pseudocode	No	The paper describes the method using text and mathematical formulas, but it does not include any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide any explicit statement about releasing the source code for the methodology described, nor does it include a link to a code repository.
Open Datasets	Yes	We conduct experiments on seven tasks of the GLUE benchmark, including textual entailment (MNLI, RTE), question answering/entailment (QNLI), question paraphrase (QQP), paraphrase (MRPC), grammatical correctness (Co LA), and sentiment analysis (SST-2). Table 1 summarizes the statistics of the GLUE tasks. We additionally experiment on three commonly used intent classiﬁcation datasets CLINC (Larson et al. 2019), BANKING (Casanueva et al. 2020) and HWU (Liu et al. 2019a).
Dataset Splits	Yes	Dataset Task Labels Train Metric Train avg length Dev avg length MNLI Textual entailment 3 393k Accuracy 29 28 QQP Question paraphrase 2 364k Accuracy 21 21 QNLI Question answering/Textual entailment 2 105k Accuracy 35 37 MRPC Paraphrase 2 3.k F1 38 39 RTE Textual entailment 2 2.5k Accuracy 51 50 Co LA Grammatical correctness 2 8.5k MCC 8 8 SST-2 Sentiment analysis 2 67k Accuracy 9 17
Hardware Specification	Yes	All our experiments were run on a single 32 GB V100 GPU.
Software Dependencies	No	The paper mentions software components like 'Adam W optimizer' and 'BERT'/'RoBERTa', but it does not provide specific version numbers for any of these software dependencies.
Experiment Setup	Yes	For all experiments, we use Adam W optimizer with 0.01 weight decay and a linear learning rate scheduler. We set max sequence length to 128 and learning rate warmup for the ﬁrst 10% of the total iterations. For BERTLarge, we set batch size to 32 and ﬁne-tune for 3 epochs. Grid search is performed over lr {0.00001, 0.00002, 0.00003}. For Ro BERTa Large, we sweep over the same learning rates as BERTLarge and batch size {16, 32}. For ﬁne-tuning with CAT, we use the exact same hyperparameter settings as the baseline, and further perform grid search over ϵ {0.0001, 0.001, 0.005, 0.02}, τ {0.05, 0.06, 0.07, 0.08, 0.09, 0.1}, and λ {0.1, 0.2, 0.3, 0.4, 0.5}.