Model-Level Dual Learning

Authors: Yingce Xia, Xu Tan, Fei Tian, Tao Qin, Nenghai Yu, Tie-Yan Liu

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our algorithms achieve significant improvements on neural machine translation and sentiment analysis.Our model is verified on two different tasks, neural machine translation and sentiment analysis. We achieve promising results: (1) On IWSLT14 Germanto-English translation, we improve the BLEU score from 32.85 to 35.19, obtaining a new record (see Table 5); (2) On WMT14 English German translation, we improve the BLEU score from 28.4 to 28.9 (see Table 3); (3) A series of state-of-the-art results on NIST Chinese-to-English are obtained (see Table 2). (4) With supervised data only, on IMDB sentiment classification dataset, we lower the error rate from 9.20% to 6.96% with our proposed framework (see Table 6).
Researcher Affiliation Collaboration 1School of Information Science and Technology, University of Science and Technology of China, Hefei, China 2Microsoft Research, Beijing, China.
Pseudocode No The paper describes the model in detail and provides mathematical formulations, but no structured pseudocode or algorithm blocks are presented.
Open Source Code No The paper does not contain an explicit statement about releasing the source code for their methodology or a link to a repository containing it.
Open Datasets Yes We choose three widely used neural machine translation tasks: IWSLT 2014 German English (briefly, IWSLT De En), LDC Chinese English (briefly, Zh En) and WMT14 English German (briefly, WMT En De) as our testbeds. (1) For IWSLT De En, we use the data extracted from IWSLT 2014 evaluation campaign (Cettolo et al., 2014)... (2) For Zh En... (Xia et al., 2017d)... (3) For WMT En De... (Jean et al., 2015; Vaswani et al., 2017; Gehring et al., 2017), we use the training set consisting of roughly 4.5M sentence pairs. We use the benchmark movie review dataset IMDB (Maas et al., 2011) for sentiment analysis.
Dataset Splits Yes For IWSLT De En, we use the data extracted from IWSLT 2014 evaluation campaign (Cettolo et al., 2014), which consists of 153k/7k/7k sentence pairs as training/validation/test sets. (2) For Zh En... NIST 2003 acts as the validation set... (3) For WMT En De... We use newstest13 and newstest14 as the validation and test sets respectively. For validation purpose, we randomly split 3750 samples from the training set as the validation set.
Hardware Specification Yes All the models are trained using NVIDIA Tesla M40 GPU. The whole training process takes three days on a single Titan XP GPU.
Software Dependencies No The paper mentions using Adam and Adadelta optimizers and basing the model on Transformer via TensorFlow's tensor2tensor project, but it does not specify versions for any software libraries or frameworks.
Experiment Setup Yes For all experiments, both the encoder and decoder contain six blocks. We use the transformer small setting for IWSLT De En, whose word embedding dimension, hidden size and feed-forward filter size are 256, 256 and 1024 respectively, and transformer big setting for Zh En and WMT En De, where the three corresponding dimensions are 1024, 1024 and 4096.7 The residual dropouts of the three tasks are 0.1, 0.3 and 0.3 respectively. We use weight tying (Press & Wolf, 2016) for the IWSLT De En and WMT En De translation... we use Adam (Kingma & Ba, 2014) as the optimizer, with initial learning rates 0.0002, β1 = 0.9 and β2 = 0.98. Each mini-batch in all tasks contains around 4096 tokens. For sentiment analysis: we set the embedding dimension (both word embedding and sentiment embedding) and the LSTM hidden layer size as 500 and 1024 respectively. The dropout rate is fixed as 0.5 for both embeddings and softmax.