FreeLB: Enhanced Adversarial Training for Natural Language Understanding
Authors: Chen Zhu, Yu Cheng, Zhe Gan, Siqi Sun, Tom Goldstein, Jingjing Liu
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To validate the effectiveness of the proposed approach, we apply it to Transformer-based models for natural language understanding and commonsense reasoning tasks. Experiments on the GLUE benchmark show that when applied only to the finetuning stage, it is able to improve the overall test scores of BERT-base model from 78.3 to 79.4, and Ro BERTa-large model from 88.5 to 88.8. In addition, the proposed approach achieves state-of-the-art single-model test accuracies of 85.44% and 67.75% on ARC-Easy and ARC-Challenge. Experiments on Commonsense QA benchmark further demonstrate that Free LB can be generalized and boost the performance of Ro BERTa-large model on other tasks as well. |
| Researcher Affiliation | Collaboration | Chen Zhu1, Yu Cheng2, Zhe Gan2, Siqi Sun2, Tom Goldstein1, Jingjing Liu2 1University of Maryland, College Park 2Microsoft Dynamics 365 AI Research |
| Pseudocode | Yes | The overall procedure is shown in Algorithm 1, in which X + δt is an approximation to the local maximum within the intersection of two balls It = BX+δ0(αt) BX(ϵ). |
| Open Source Code | Yes | 1Code is available at https://github.com/zhuchen03/Free LB. |
| Open Datasets | Yes | GLUE Benchmark. The GLUE benchmark is a collection of 9 natural language understanding tasks, namely Corpus of Linguistic Acceptability (Co LA; Warstadt et al. (2018)), Stanford Sentiment Treebank (SST; Socher et al. (2013)), Microsoft Research Paraphrase Corpus (MRPC; Dolan & Brockett (2005)), Semantic Textual Similarity Benchmark (STS; Agirre et al. (2007)), Quora Question Pairs (QQP; Iyer et al. (2017)), Multi-Genre NLI (MNLI; Williams et al. (2018)), Question NLI (QNLI; Rajpurkar et al. (2016)), Recognizing Textual Entailment (RTE; Dagan et al. (2006); Bar Haim et al. (2006); Giampiccolo et al. (2007); Bentivogli et al. (2009)) and Winograd NLI (WNLI; Levesque et al. (2011)). |
| Dataset Splits | Yes | We summarize results on the dev sets of GLUE in Table 1, comparing the proposed Free LB against other adversatial training algorithms (PGD (Madry et al., 2018) and Free AT (Shafahi et al., 2019)). |
| Hardware Specification | No | The paper does not specify any hardware details like GPU models, CPU types, or memory used for the experiments. |
| Software Dependencies | No | The paper mentions using 'Hugging Face implementation' for BERT-base and 'fairseq implementation' for RoBERTa, but it does not provide specific version numbers for these or other software dependencies like Python, PyTorch, or specific libraries. |
| Experiment Setup | Yes | As other adversarial training methods, introduces three additional hyper-parameters: step size α, maximum perturbation ϵ, number of steps m. For all other hyper-parameters such as learning rate and number of iterations, we either search in the same interval as Ro BERTa (on Commonsense QA, ARC, and WNLI), or use exactly the same setting as Ro BERTa (except for MRPC, where we find using a learning rate of 5 10 6 gives better results).13. We list the best combinations for α, ϵ and m for each of the GLUE tasks in Table 6. For WSC/WNLI, the best combination is ϵ = 1e 2, α = 5e 3, m = 2. |