EL-Attention: Memory Efficient Lossless Attention for Generation

Authors: Yu Yan, Jiusheng Chen, Weizhen Qi, Nikhil Bhendawade, Yeyun Gong, Nan Duan, Ruofei Zhang

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments on Transformer, BART, and GPT-2 for summarization and question generation tasks. The results show EL-attention speeds up existing models by 1.6x to 5.3x without accuracy loss.
Researcher Affiliation Collaboration 1Microsoft, Redmond, WA, USA 2University of Science and Technology of China 3Microsoft Research Asia 4Microsoft, Sunnyvale, CA, USA.
Pseudocode Yes We present pseudocode to show the key difference between EL-attention and multi-head attention in Appendix A.
Open Source Code Yes Our code is open sourced in https://github.com/microsoft/fastseq.
Open Datasets Yes SQu AD 1.1 (Rajpurkar et al., 2016) contains over 100K questions in 536 Wikipedia articles. XSum (Narayan et al., 2018) consists online articles from BBC. CNN/Daily Mail (Hermann et al., 2015) contains articles from CNN and Daily Mail newspapers.
Dataset Splits Yes SQu AD 1.1 (Rajpurkar et al., 2016) ... has 75722/10570/11877 samples in training/validation/test set. XSum (Narayan et al., 2018) ... There are 204017/11327/11333 samples in training/validation/test set. CNN/Daily Mail (Hermann et al., 2015) ... There are 287113/13368/11490 samples in training/validation/test set.
Hardware Specification Yes We conduct experiments on a NVIDIA Tesla V100 PCIe 16GB.
Software Dependencies Yes For Transformer model and BART model, we use implementations in Fair Seq (Ott et al., 2019) v0.9.0; for GPT-2 model, we use Huggingface Transformers (Wolf et al., 2020) v3.0.2.
Experiment Setup Yes In SQu AD 1.1 task, we set length penalty to 1.0, max input length is 512, and beam size is 4. In XSum task, we use parameters listed in BART2, with length penalty 1.0, max input length 1024, max output length 60, min output length 10, and beam size 6. In CNN/Daily Mail task, we conduct experiments for both BART and GPT-2. For BART model, we follow their parameters, with length penalty 2.0, max input length 1024, max output length 140, min output length 55, and beam size 4. For GPT-2 model, following their paper, we directly use pretrained model checkpoint to generate summarization, max input length is 512, max output length is set to 200.