Convolutional Sequence to Sequence Learning

Authors: Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann N. Dauphin

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our approach on several large datasets for machine translation as well as summarization and compare to the current best architectures reported in the literature. On WMT 16 English-Romanian translation we achieve a new state of the art, outperforming the previous best result by 1.9 BLEU.
Researcher Affiliation Industry 1Facebook AI Research. Correspondence to: Jonas Gehring <jgehring@fb.com>, Michael Auli<michaelauli@fb.com>.
Pseudocode No The paper describes its methods using text and mathematical equations, but it does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes The source code and models are available at https: //github.com/facebookresearch/fairseq.
Open Datasets Yes WMT 16 English-Romanian. We use the same data and pre-processing as Sennrich et al. (2016b)..., WMT 14 English-German. We use the same setup as Luong et al. (2015)..., Abstractive summarization. We train on the Gigaword corpus (Graff et al., 2003) and ... We evaluate on the DUC-2004 test data comprising 500 article-title pairs (Over et al., 2007)
Dataset Splits Yes In all setups a small subset of the training data serves as validation set (about 0.5-1%) for early stopping and learning rate annealing.
Hardware Specification Yes All models are implemented in Torch (Collobert et al., 2011) and trained on a single Nvidia M40 GPU... we measure GPU speed on three generations of Nvidia cards: a GTX-1080ti, an M40 as well as an older K40 card. CPU timings are measured on one host with 48 hyper-threaded cores (Intel Xeon E5-2680 @ 2.50GHz) with 40 workers.
Software Dependencies No The paper states 'All models are implemented in Torch (Collobert et al., 2011)' but does not provide specific version numbers for Torch or any other software dependencies needed for replication.
Experiment Setup Yes We use 512 hidden units for both encoders and decoders... We train our convolutional models with Nesterov s accelerated gradient method... using a momentum value of 0.99 and renormalize gradients if their norm exceeds 0.1... We use a learning rate of 0.25... we use mini-batches of 64 sentences... we also apply dropout to the input of the convolutional blocks.