reproducibilityindex.ai

Token-Aware Virtual Adversarial Training in Natural Language Understanding

Authors: Linyang Li, Xipeng Qiu8410-8418

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that our method improves the performance of pre-trained models such as BERT and ALBERT in various tasks by a considerable margin. The proposed method improves the score of the GLUE benchmark from 78.3 to 80.9 using BERT model and it also enhances the performance of sequence labeling and text classiﬁcation tasks. We construct extensive experiments to evaluate the effectiveness of these ﬁne-grained token-aware virtual adversarial samples.
Researcher Affiliation	Academia	Linyang Li, Xipeng Qiu Shanghai Key Laboratory of Intelligent Information Processing, Fudan University School of Computer Science, Fudan University {linyangli19, xpqiu}@fudan.edu.cn
Pseudocode	Yes	Algorithm 1 Token-Aware Virtual Adversarial Training
Open Source Code	No	The paper refers to third-party open-source libraries and models like Huggingface Transformers, BERT, and ALBERT (e.g., "We implement our TA-VAT method with Py Torch based on Huggingface Transformers 3."). It also states, "We re-implement results of BERT, Free AT, and Free LB methods based on their open-released codes." However, it does not explicitly state that the source code for their TA-VAT method is publicly available or provide a link to it.
Open Datasets	Yes	To evaluate the proposed TA-VAT, we construct extensive experiments over common NLP tasks: text classiﬁcation, natural language inference, and named entity recognition. We test on widely-used datasets: GLUE benchmark (Wang et al. 2019), Con LL2003 NER dataset (Tjong Kim Sang and De Meulder 2003), Ontonotes5.0 NER dataset (Weischedel et al. 2011), IMDB dataset and AG s NEWs dataset. GLUE dataset is a collection of natural language understanding tasks, namely Multi-genre Natural Language Inference (MNLI (Williams, Nangia, and Bowman 2018)); Quora Question Pairs (QQP 1); Recognizing Textual Entailment (RTE (Dagan, Glickman, and Magnini 2005)); Question Natural Language Inference (QNLI (Rajpurkar et al. 2016)); Microsoft Research Paraphrase Corpus (MRPC (Dolan and Brockett 2005)); Corpus of Linguistic Acceptability(Co LA (Warstadt, Singh, and Bowman 2018)); Standard Sentiment Treebank (SST-2 (Socher et al. 2013)); Semantic Textual Similarity Benchmark (STS-B (Agirre, M arquez, and Wicentowski 2007). Footnotes provide direct links to QQP (1https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs) and IMDB (2https://datasets.imdbws.com/).
Dataset Splits	No	The paper mentions using a "development set" (e.g., Table 1: "Evaluation results on the development set of GLUE benchmark") which serves as a validation set. However, it does not provide specific details for all datasets regarding the exact percentages, sample counts for validation splits, or detailed splitting methodology (e.g., random seed, stratified splitting) needed to fully reproduce the data partitioning across all experiments. While GLUE has predefined splits, these are not explicitly stated within the paper's text for reproducibility.
Hardware Specification	Yes	All models are trained using NVIDIA Titan XP GPUs.
Software Dependencies	No	The paper states: "We implement our TA-VAT method with Py Torch based on Huggingface Transformers 3." and refers to specific GitHub repositories for BERT and ALBERT. However, it does not provide specific version numbers for PyTorch, Huggingface Transformers, or any other software dependencies, which are necessary for reproducible descriptions.
Experiment Setup	No	The paper states: "Parameters such as the running epoch, learning rate, batch size and warmup step settings are the same as used in the standard ﬁne-tuning process of BERT and ALBERT." and "As for hyper-parameters such as the adversarial training step K, the constrain bound of the perturbation ϵ, the initialization bound σ and the adversarial step size α, we adopt parameters the same as used in Free LB for a fair comparison." While it refers to hyperparameters and states they are adopted from other works, it does not explicitly list the concrete values of these hyperparameters or system-level training settings within the paper itself, requiring external lookup for reproducibility.