Improving Neural Language Modeling via Adversarial Training

Authors: Dilin Wang, Chengyue Gong, Qiang Liu

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we show that our method improves on the single model state-of-the-art results for language modeling on Penn Treebank (PTB) and Wikitext-2, achieving test perplexity scores of 46.01 and 38.65, respectively. ... We demonstrate the effectiveness of our method in two applications: neural language modeling and neural machine translation, and compare them with state-of-the-art architectures and learning methods.
Researcher Affiliation Academia 1Department of Computer Science, UT Austin. Correspondence to: Dilin Wang <dilin@cs.utexas.edu>, Chengyue Gong <cygong@cs.utexas.edu>.
Pseudocode Yes Algorithm 1 Adversarial MLE Training
Open Source Code Yes Our code is available at: https://github. com/Chengyue Gong R/advsoft.
Open Datasets Yes We test our method on three benchmark datasets: Penn Treebank (PTB), Wikitext-2 (WT2) and Wikitext-103 (WT103). ... The PTB corpus (Marcus et al., 1993) has been a standard dataset used for benchmarking language models.
Dataset Splits Yes PTB The PTB corpus (Marcus et al., 1993) has been a standard dataset used for benchmarking language models. It consists of 923k training, 73k validation and 82k test words.
Hardware Specification No The paper mentions 'GPUs' in a general sense but does not provide specific hardware details such as exact GPU or CPU models, processor types, or memory amounts used for running experiments.
Software Dependencies No The paper mentions using 'Tensor2Tensor (Vaswani et al., 2018)' for implementation but does not specify version numbers for this or any other software libraries or dependencies used in the experiments.
Experiment Setup Yes We set α = 0.005 for the rest of experiments unless otherwise specified. ... For Transformer-Small, we stack a 4-layer encoder and a 4-layer decoder with 256dimensional hidden units per layer. For Transformer-Base, we set the batch size to 6400 and the dropout rate to 0.4 following Wang et al. (2019).