reproducibilityindex.ai

Pay Less Attention with Lightweight and Dynamic Convolutions

Authors: Felix Wu, Angela Fan, Alexei Baevski, Yann Dauphin, Michael Auli

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on large-scale machine translation, language modeling and abstractive summarization show that dynamic convolutions improve over strong self-attention models. On the WMT 14 English-German test set dynamic convolutions achieve a new state of the art of 29.7 BLEU. We evaluate on three different tasks: machine translation, language modeling and abstractive summarization.
Researcher Affiliation	Collaboration	Felix Wu Cornell University Angela Fan, Alexei Baevski, Yann N. Dauphin, Michael Auli Facebook AI Research
Pseudocode	No	The paper describes methods in text and uses mathematical formulas, but does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code and pre-trained models available at http://github.com/pytorch/fairseq
Open Datasets	Yes	For WMT English to German (En De) we replicate the setup of Vaswani et al. (2017), based on WMT 16 training data with 4.5M sentence pairs... For WMT English to French (En Fr), we borrow the setup of Gehring et al. (2017) with 36M training sentence pairs from WMT 14... We evaluate on the large-scale Billion word dataset (Chelba et al., 2013)... CNN-Daily Mail summarization task (Hermann et al., 2015; Nallapati et al., 2016).
Dataset Splits	Yes	validate on newstest2013 and test on newstest2014.3... validate on newstest2012+2013 and test on newstest2014... Models are evaluated in terms of perplexity on the valid and test portions. Ablations are conducted on the validation set and we report the mean BLEU and standard deviation on this set.
Hardware Specification	Yes	We train the WMT models on 8 NVIDIA V100 GPUs... Speed results based on beam size 4, batch size 256 on an NVIDIA P100 GPU.
Software Dependencies	No	The paper mentions 'fairseq re-implementation of the Transformer Big architecture (Ott et al., 2018)' and 'PyTorch', but no specific version numbers for these software components are provided.
Experiment Setup	Yes	We use a dropout rate of 0.3 for WMT En-De and IWSLT De-En, 0.1 for WMT En Fr, and 0.25 for WMT Zh-En. WMT models are optimized with Adam and a cosine learning rate schedule... learning rate is ﬁrst linearly warmed up for 10K steps from 10-7 to 10-3... We use label smoothing with 0.1 weight for the uniform prior distribution over the vocabulary.