Multi-task Sequence to Sequence Learning

Authors: Minh-Thang Luong, Quoc Le, Ilya Sutskever, Oriol Vinyals, Lukasz Kaiser

ICLR 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our results show that training on a small amount of parsing and image caption data can improve the translation quality between English and German by up to 1.5 BLEU points over strong single-task baselines on the WMT benchmarks. Furthermore, we have established a new state-of-the-art result in constituent parsing with 93.0 F1. Lastly, we reveal interesting properties of the two unsupervised learning objectives, autoencoder and skip-thought, in the MTL context: autoencoder helps less in terms of perplexities but more on BLEU scores compared to skip-thought.
Researcher Affiliation Collaboration Minh-Thang Luong , Quoc V. Le, Ilya Sutskever, Oriol Vinyals, Lukasz Kaiser Google Brain lmthang@stanford.edu,{qvl,ilyasu,vinyals,lukaszkaiser}@google.com
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any concrete access to source code for the methodology described, nor does it mention a specific repository link or explicit code release statement.
Open Datasets Yes We use the WMT 15 data (Bojar et al., 2015) for the English German translation problem. ... Penn Tree Bank (PTB) dataset (Marcus et al., 1993) and, ...the high-confidence (HC) parse trees provided by Vinyals et al. (2015a). Lastly, for image caption generation, we use a dataset of image and caption pairs provided by Vinyals et al. (2015b).
Dataset Splits Yes We use newstest2013 (3000 sentences) as a validation set to select our hyperparameters... For testing, to be comparable with existing results in (Luong et al., 2015a), we use the filtered newstest2014 (2737 sentences) for the English German translation task and newstest2015 (2169 sentences) for the German English task. ... The two parsing tasks, however, are evaluated on the same validation (section 22) and test (section 23) sets from the PTB data.
Hardware Specification No The paper describes model architecture and training parameters (e.g., '4 LSTM layers each of which has 1000-dimensional cells and embeddings', 'mini-batch size of 128') but does not specify any hardware details like GPU models, CPU types, or memory used for the experiments.
Software Dependencies No The paper mentions 'Moses' for tokenization and 'SGD' for optimization, but does not provide specific version numbers for any software dependencies or libraries used in the experiments.
Experiment Setup Yes In all experiments, following Sutskever et al. (2014) and Luong et al. (2015b), we train deep LSTM models as follows: (a) we use 4 LSTM layers each of which has 1000-dimensional cells and embeddings, (b) parameters are uniformly initialized in [-0.06, 0.06], (c) we use a mini-batch size of 128, (d) dropout is applied with probability of 0.2 over vertical connections (Pham et al., 2014), (e) we use SGD with a fixed learning rate of 0.7, (f) input sequences are reversed, and lastly, (g) we use a simple finetuning schedule after x epochs, we halve the learning rate every y epochs. The values x and y are referred as finetune start and finetune cycle in Table 1 together with the number of training epochs per task.