reproducibilityindex.ai

MAT: Mixed-Strategy Game of Adversarial Training in Fine-tuning

Authors: Zhehua Zhong, Tianyi Chen, Zhen Wang

IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To verify the effectiveness of MAT, we conducted extensive benchmark experiments on large-scale pre-trained models, such as BERT and RoBERTa. MAT significantly outperforms the state-of-the-art methods on both the GLUE and ANLI benchmarks in terms of generalization and robustness. In this section, in order to verify the effectiveness of our proposed algorithm, MAT1, we conducted a comprehensive evaluation of MAT using two widely recognized natural language understanding benchmarks: GLUE [Wang et al., 2019] and ANLI [Nie et al., 2020].
Researcher Affiliation	Collaboration	Zhehua Zhong 1 , Tianyi Chen 2 and Zhen Wang 1 1 School of Cyberspace, Hangzhou Dianzi University 2 Microsoft
Pseudocode	Yes	Algorithm 1 EMD Adversarial Training; Algorithm 2 MAT: Mixed-strategy Adversarial Training
Open Source Code	Yes	Code is available at https://github.com/Charles-Zhong/MAT.
Open Datasets	Yes	GLUE (General Language Understanding Evaluation) [Wang et al., 2019] benchmark is a collection of nine tasks for natural language understanding model training and evaluation. ANLI (Adversarial Natural Language Inference) [Nie et al., 2020] is an adversarial benchmark that is compiled through an iterative, adversarial human-and-model-in-the-loop process. Table 6: Summary of the datasets in the GLUE and ANLI benchmark.
Dataset Splits	Yes	Table 6: Summary of the datasets in the GLUE and ANLI benchmark. Train-set, Dev-set, Test-set
Hardware Specification	Yes	Our experiments were conducted on multiple servers that are equipped with NVIDIA V100 and NVIDIA 3090 GPUs.
Software Dependencies	No	The paper mentions software like PyTorch, Huggingface transformers, and datasets libraries, but does not specify their version numbers.
Experiment Setup	Yes	The text data in all the datasets are tokenized, and we limit the maximum length to 512 tokens. ... we utilize a range of sampling techniques, including Stochastic Gradient Langevin Dynamics (SGLD) sampling, along with its preconditioned variants such as the RMSProp-preconditioned and Adam-preconditioned versions of SGLD. Additionally, we experiment with a range of learning rates, specifically within the interval {1e-5, 3e-5, 5e-5}, and employ various batch sizes, including 8, 16, 32, and 64. To ensure the robustness, we set a maximum norm of adversarial perturbation at 1e-5, and implement a clipping mechanism to curtail any perturbations that exceed this threshold. Table 7: Hyperparameter Search Range Sampling Times K {5, 10, 15, 20, 30} Sampling Step Size γ {1e-5, 3e-5, 5e-5} Beta β {0.1, 0.3, 0.5, 0.7, 0.9} Lambda λ {0.1, 1, 2, 3, 4, 5}.