Representation Degeneration Problem in Training Natural Language Generation Models
Authors: Jun Gao, Di He, Xu Tan, Tao Qin, Liwei Wang, Tieyan Liu
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on language modeling and machine translation show that our method can largely mitigate the representation degeneration problem and achieve better performance than baseline algorithms. We conduct experiments on two basic natural language generation tasks: language modeling and machine translation, and report the results in this section. |
| Researcher Affiliation | Collaboration | Jun Gao1,2 , Di He3, Xu Tan4, Tao Qin4, Liwei Wang3,5 & Tie-Yan Liu4 1Department of Computer Science, University of Toronto jungao@cs.toronto.edu 2Vector Institute, Canada 3Key Laboratory of Machine Perception, MOE, School of EECS, Peking University di he@pku.edu.cn, wanglw@cis.pku.cn 4Microsoft Research {xuta,taoqin,tyliu}@microsoft.com 5Center for Data Science, Peking University, Beijing Institute of Big Data Research |
| Pseudocode | No | The paper contains mathematical formulations and theorems but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | No | Our implementation was based on open-sourced code3 by Merity et al. (2018). (Footnote 3: https://github.com/salesforce/awd-lstm-lm) This refers to the code of a baseline model used, not the authors' own implementation of their proposed methodology (MLE-Cos Reg). |
| Open Datasets | Yes | We used Wiki Text-2 (WT2) corpus, which is popularly used in many previous works (Merity et al., 2017; Inan et al., 2017; Grave et al., 2017). We used the dataset from standard WMT 2014, which consists of 4.5 million English-German sentence pairs and has been widely used as the benchmark for neural machine translation (Vaswani et al., 2017; Gehring et al., 2017). |
| Dataset Splits | Yes | We used Wiki Text-2 (WT2) corpus, which is popularly used in many previous works (Merity et al., 2017; Inan et al., 2017; Grave et al., 2017). (Table 1 includes 'Validation' and 'Test' columns for language modeling results, implying standard splits.) |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the experiments, such as GPU or CPU models. |
| Software Dependencies | No | The paper mentions basing implementations on 'open-sourced code by Merity et al. (2018)' and 'official code from Transformer (Vaswani et al., 2018)' but does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | We used a three-layer LSTM with 1150 units in the hidden layer and set the size of embedding to be 400. We trained the model with Averaged Stochastic Gradient Descent. For our proposed MLE-Cos Reg loss, we found the hyperparameter γ is not very sensitive and we set it to 1 in the experiments. We followed the setting in Vaswani et al. (2017)... used the base version of Transformer... which has a 6-layer encoder and 6-layer decoder, the size of hidden nodes and embedding are set to 512. All the models were trained with Adam optimizer, and all the hyperparameters were set as default as in Vaswani et al. (2017). γ is set to 1 as in the experiments of language modeling. |