Lite Transformer with Long-Short Range Attention

Authors: Zhanghao Wu*, Zhijian Liu*, Ji Lin, Yujun Lin, Song Han

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that our Lite Transformer model offers significant improvements over the transformer on three language tasks: machine translation, abstractive summarization, and language modeling.
Researcher Affiliation Academia Zhanghao Wu 1,2 Zhijian Liu 1 Ji Lin1 Yujun Lin1 Song Han1 1Massachusetts Institute of Technology 2Shanghai Jiao Tong University {zhwu, zhijian, songhan}@mit.edu
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Code has been made available at https://github.com/mit-han-lab/lite-transformer.
Open Datasets Yes For IWSLT 14 German-English (De-En), we follow the settings in Grave et al. (2017) with 160K training sentence pairs and 10K joint byte pair encoding (BPE) (Sennrich et al., 2016) vocabulary in lower case. For WMT English to German (En-De), we train the model on WMT 16 training data with 4.5M sentence pairs, validate on newstest2013, and test on newstest2014, the same as Wu et al. (2019b). Moreover, the vocabulary used a 32K joint source and target BPE. For WMT English to Franch (En-Fr), we replicate the setup in Gehring et al. (2017) with 36M training sentence pairs from WMT 14, validate on newstest2012 and 2013, and test on newstest2014.
Dataset Splits Yes For WMT English to German (En-De), we train the model on WMT 16 training data with 4.5M sentence pairs, validate on newstest2013, and test on newstest2014, the same as Wu et al. (2019b). For WMT English to Franch (En-Fr), we replicate the setup in Gehring et al. (2017) with 36M training sentence pairs from WMT 14, validate on newstest2012 and 2013, and test on newstest2014.
Hardware Specification Yes We train WMT and summarization models on 16 NVIDIA RTX 2080Ti GPUs and IWSLT De-En on a single GPU for 50K steps.
Software Dependencies No The paper mentions software like "fairseq s reimplementation (Ott et al., 2019)" and "Adam optimizer", but it does not provide specific version numbers for these or other critical software dependencies (e.g., PyTorch, TensorFlow, Python version) needed for replication.
Experiment Setup Yes We use a dropout of 0.3 for both the WMT and IWSLT datasets and linearly scale down the dropout ratio when shrinking the dimension of the embeddings for the WMT datasets. Same as Wu et al. (2019b), we apply Adam optimizer and a cosine learning rate schedule (Kingma & Ba, 2015; Loshchilov & Hutter, 2017) for the WMT models, where the learning rate is first linearly warm up from 10 7 to 10 3 followed by a cosine annealing with a single cycle. For IWSLT De-En, we use inverse square root learning rate scheduling (Vaswani et al., 2017) with the linear warm-up. We use the same training settings for summarization. For the language modeling task, the training settings are in line with Baevski & Auli (2019). We decrease the dropout ratio for the FFN layer by half in our Lite Transformer due to the flattened layer. ... Label smooth of 0.1 is applied for the prior distribution over the vocabulary (Szegedy et al., 2016; Pereyra et al., 2017).