MAT: Mixed-Strategy Game of Adversarial Training in Fine-tuning
Authors: Zhehua Zhong, Tianyi Chen, Zhen Wang
IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To verify the effectiveness of MAT, we conducted extensive benchmark experiments on large-scale pre-trained models, such as BERT and RoBERTa. MAT significantly outperforms the state-of-the-art methods on both the GLUE and ANLI benchmarks in terms of generalization and robustness. In this section, in order to verify the effectiveness of our proposed algorithm, MAT1, we conducted a comprehensive evaluation of MAT using two widely recognized natural language understanding benchmarks: GLUE [Wang et al., 2019] and ANLI [Nie et al., 2020]. |
| Researcher Affiliation | Collaboration | Zhehua Zhong 1 , Tianyi Chen 2 and Zhen Wang 1 1 School of Cyberspace, Hangzhou Dianzi University 2 Microsoft |
| Pseudocode | Yes | Algorithm 1 EMD Adversarial Training; Algorithm 2 MAT: Mixed-strategy Adversarial Training |
| Open Source Code | Yes | Code is available at https://github.com/Charles-Zhong/MAT. |
| Open Datasets | Yes | GLUE (General Language Understanding Evaluation) [Wang et al., 2019] benchmark is a collection of nine tasks for natural language understanding model training and evaluation. ANLI (Adversarial Natural Language Inference) [Nie et al., 2020] is an adversarial benchmark that is compiled through an iterative, adversarial human-and-model-in-the-loop process. Table 6: Summary of the datasets in the GLUE and ANLI benchmark. |
| Dataset Splits | Yes | Table 6: Summary of the datasets in the GLUE and ANLI benchmark. Train-set, Dev-set, Test-set |
| Hardware Specification | Yes | Our experiments were conducted on multiple servers that are equipped with NVIDIA V100 and NVIDIA 3090 GPUs. |
| Software Dependencies | No | The paper mentions software like PyTorch, Huggingface transformers, and datasets libraries, but does not specify their version numbers. |
| Experiment Setup | Yes | The text data in all the datasets are tokenized, and we limit the maximum length to 512 tokens. ... we utilize a range of sampling techniques, including Stochastic Gradient Langevin Dynamics (SGLD) sampling, along with its preconditioned variants such as the RMSProp-preconditioned and Adam-preconditioned versions of SGLD. Additionally, we experiment with a range of learning rates, specifically within the interval {1e-5, 3e-5, 5e-5}, and employ various batch sizes, including 8, 16, 32, and 64. To ensure the robustness, we set a maximum norm of adversarial perturbation at 1e-5, and implement a clipping mechanism to curtail any perturbations that exceed this threshold. Table 7: Hyperparameter Search Range Sampling Times K {5, 10, 15, 20, 30} Sampling Step Size γ {1e-5, 3e-5, 5e-5} Beta β {0.1, 0.3, 0.5, 0.7, 0.9} Lambda λ {0.1, 1, 2, 3, 4, 5}. |