Pay Less Attention with Lightweight and Dynamic Convolutions

Authors: Felix Wu, Angela Fan, Alexei Baevski, Yann Dauphin, Michael Auli

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on large-scale machine translation, language modeling and abstractive summarization show that dynamic convolutions improve over strong self-attention models. On the WMT 14 English-German test set dynamic convolutions achieve a new state of the art of 29.7 BLEU. We evaluate on three different tasks: machine translation, language modeling and abstractive summarization.
Researcher Affiliation Collaboration Felix Wu Cornell University Angela Fan, Alexei Baevski, Yann N. Dauphin, Michael Auli Facebook AI Research
Pseudocode No The paper describes methods in text and uses mathematical formulas, but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes Code and pre-trained models available at http://github.com/pytorch/fairseq
Open Datasets Yes For WMT English to German (En De) we replicate the setup of Vaswani et al. (2017), based on WMT 16 training data with 4.5M sentence pairs... For WMT English to French (En Fr), we borrow the setup of Gehring et al. (2017) with 36M training sentence pairs from WMT 14... We evaluate on the large-scale Billion word dataset (Chelba et al., 2013)... CNN-Daily Mail summarization task (Hermann et al., 2015; Nallapati et al., 2016).
Dataset Splits Yes validate on newstest2013 and test on newstest2014.3... validate on newstest2012+2013 and test on newstest2014... Models are evaluated in terms of perplexity on the valid and test portions. Ablations are conducted on the validation set and we report the mean BLEU and standard deviation on this set.
Hardware Specification Yes We train the WMT models on 8 NVIDIA V100 GPUs... Speed results based on beam size 4, batch size 256 on an NVIDIA P100 GPU.
Software Dependencies No The paper mentions 'fairseq re-implementation of the Transformer Big architecture (Ott et al., 2018)' and 'PyTorch', but no specific version numbers for these software components are provided.
Experiment Setup Yes We use a dropout rate of 0.3 for WMT En-De and IWSLT De-En, 0.1 for WMT En Fr, and 0.25 for WMT Zh-En. WMT models are optimized with Adam and a cosine learning rate schedule... learning rate is first linearly warmed up for 10K steps from 10-7 to 10-3... We use label smoothing with 0.1 weight for the uniform prior distribution over the vocabulary.