MAT: Mixed-Strategy Game of Adversarial Training in Fine-tuning

Authors: Zhehua Zhong, Tianyi Chen, Zhen Wang

IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To verify the effectiveness of MAT, we conducted extensive benchmark experiments on large-scale pre-trained models, such as BERT and RoBERTa. MAT significantly outperforms the state-of-the-art methods on both the GLUE and ANLI benchmarks in terms of generalization and robustness. In this section, in order to verify the effectiveness of our proposed algorithm, MAT1, we conducted a comprehensive evaluation of MAT using two widely recognized natural language understanding benchmarks: GLUE [Wang et al., 2019] and ANLI [Nie et al., 2020].
Researcher Affiliation Collaboration Zhehua Zhong 1 , Tianyi Chen 2 and Zhen Wang 1 1 School of Cyberspace, Hangzhou Dianzi University 2 Microsoft
Pseudocode Yes Algorithm 1 EMD Adversarial Training; Algorithm 2 MAT: Mixed-strategy Adversarial Training
Open Source Code Yes Code is available at https://github.com/Charles-Zhong/MAT.
Open Datasets Yes GLUE (General Language Understanding Evaluation) [Wang et al., 2019] benchmark is a collection of nine tasks for natural language understanding model training and evaluation. ANLI (Adversarial Natural Language Inference) [Nie et al., 2020] is an adversarial benchmark that is compiled through an iterative, adversarial human-and-model-in-the-loop process. Table 6: Summary of the datasets in the GLUE and ANLI benchmark.
Dataset Splits Yes Table 6: Summary of the datasets in the GLUE and ANLI benchmark. Train-set, Dev-set, Test-set
Hardware Specification Yes Our experiments were conducted on multiple servers that are equipped with NVIDIA V100 and NVIDIA 3090 GPUs.
Software Dependencies No The paper mentions software like PyTorch, Huggingface transformers, and datasets libraries, but does not specify their version numbers.
Experiment Setup Yes The text data in all the datasets are tokenized, and we limit the maximum length to 512 tokens. ... we utilize a range of sampling techniques, including Stochastic Gradient Langevin Dynamics (SGLD) sampling, along with its preconditioned variants such as the RMSProp-preconditioned and Adam-preconditioned versions of SGLD. Additionally, we experiment with a range of learning rates, specifically within the interval {1e-5, 3e-5, 5e-5}, and employ various batch sizes, including 8, 16, 32, and 64. To ensure the robustness, we set a maximum norm of adversarial perturbation at 1e-5, and implement a clipping mechanism to curtail any perturbations that exceed this threshold. Table 7: Hyperparameter Search Range Sampling Times K {5, 10, 15, 20, 30} Sampling Step Size γ {1e-5, 3e-5, 5e-5} Beta β {0.1, 0.3, 0.5, 0.7, 0.9} Lambda λ {0.1, 1, 2, 3, 4, 5}.