Improving Neural Language Modeling via Adversarial Training
Authors: Dilin Wang, Chengyue Gong, Qiang Liu
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we show that our method improves on the single model state-of-the-art results for language modeling on Penn Treebank (PTB) and Wikitext-2, achieving test perplexity scores of 46.01 and 38.65, respectively. ... We demonstrate the effectiveness of our method in two applications: neural language modeling and neural machine translation, and compare them with state-of-the-art architectures and learning methods. |
| Researcher Affiliation | Academia | 1Department of Computer Science, UT Austin. Correspondence to: Dilin Wang <dilin@cs.utexas.edu>, Chengyue Gong <cygong@cs.utexas.edu>. |
| Pseudocode | Yes | Algorithm 1 Adversarial MLE Training |
| Open Source Code | Yes | Our code is available at: https://github. com/Chengyue Gong R/advsoft. |
| Open Datasets | Yes | We test our method on three benchmark datasets: Penn Treebank (PTB), Wikitext-2 (WT2) and Wikitext-103 (WT103). ... The PTB corpus (Marcus et al., 1993) has been a standard dataset used for benchmarking language models. |
| Dataset Splits | Yes | PTB The PTB corpus (Marcus et al., 1993) has been a standard dataset used for benchmarking language models. It consists of 923k training, 73k validation and 82k test words. |
| Hardware Specification | No | The paper mentions 'GPUs' in a general sense but does not provide specific hardware details such as exact GPU or CPU models, processor types, or memory amounts used for running experiments. |
| Software Dependencies | No | The paper mentions using 'Tensor2Tensor (Vaswani et al., 2018)' for implementation but does not specify version numbers for this or any other software libraries or dependencies used in the experiments. |
| Experiment Setup | Yes | We set α = 0.005 for the rest of experiments unless otherwise specified. ... For Transformer-Small, we stack a 4-layer encoder and a 4-layer decoder with 256dimensional hidden units per layer. For Transformer-Base, we set the batch size to 6400 and the dropout rate to 0.4 following Wang et al. (2019). |