Convolutional Sequence to Sequence Learning
Authors: Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann N. Dauphin
ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our approach on several large datasets for machine translation as well as summarization and compare to the current best architectures reported in the literature. On WMT 16 English-Romanian translation we achieve a new state of the art, outperforming the previous best result by 1.9 BLEU. |
| Researcher Affiliation | Industry | 1Facebook AI Research. Correspondence to: Jonas Gehring <jgehring@fb.com>, Michael Auli<michaelauli@fb.com>. |
| Pseudocode | No | The paper describes its methods using text and mathematical equations, but it does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The source code and models are available at https: //github.com/facebookresearch/fairseq. |
| Open Datasets | Yes | WMT 16 English-Romanian. We use the same data and pre-processing as Sennrich et al. (2016b)..., WMT 14 English-German. We use the same setup as Luong et al. (2015)..., Abstractive summarization. We train on the Gigaword corpus (Graff et al., 2003) and ... We evaluate on the DUC-2004 test data comprising 500 article-title pairs (Over et al., 2007) |
| Dataset Splits | Yes | In all setups a small subset of the training data serves as validation set (about 0.5-1%) for early stopping and learning rate annealing. |
| Hardware Specification | Yes | All models are implemented in Torch (Collobert et al., 2011) and trained on a single Nvidia M40 GPU... we measure GPU speed on three generations of Nvidia cards: a GTX-1080ti, an M40 as well as an older K40 card. CPU timings are measured on one host with 48 hyper-threaded cores (Intel Xeon E5-2680 @ 2.50GHz) with 40 workers. |
| Software Dependencies | No | The paper states 'All models are implemented in Torch (Collobert et al., 2011)' but does not provide specific version numbers for Torch or any other software dependencies needed for replication. |
| Experiment Setup | Yes | We use 512 hidden units for both encoders and decoders... We train our convolutional models with Nesterov s accelerated gradient method... using a momentum value of 0.99 and renormalize gradients if their norm exceeds 0.1... We use a learning rate of 0.25... we use mini-batches of 64 sentences... we also apply dropout to the input of the convolutional blocks. |