EL-Attention: Memory Efficient Lossless Attention for Generation
Authors: Yu Yan, Jiusheng Chen, Weizhen Qi, Nikhil Bhendawade, Yeyun Gong, Nan Duan, Ruofei Zhang
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments on Transformer, BART, and GPT-2 for summarization and question generation tasks. The results show EL-attention speeds up existing models by 1.6x to 5.3x without accuracy loss. |
| Researcher Affiliation | Collaboration | 1Microsoft, Redmond, WA, USA 2University of Science and Technology of China 3Microsoft Research Asia 4Microsoft, Sunnyvale, CA, USA. |
| Pseudocode | Yes | We present pseudocode to show the key difference between EL-attention and multi-head attention in Appendix A. |
| Open Source Code | Yes | Our code is open sourced in https://github.com/microsoft/fastseq. |
| Open Datasets | Yes | SQu AD 1.1 (Rajpurkar et al., 2016) contains over 100K questions in 536 Wikipedia articles. XSum (Narayan et al., 2018) consists online articles from BBC. CNN/Daily Mail (Hermann et al., 2015) contains articles from CNN and Daily Mail newspapers. |
| Dataset Splits | Yes | SQu AD 1.1 (Rajpurkar et al., 2016) ... has 75722/10570/11877 samples in training/validation/test set. XSum (Narayan et al., 2018) ... There are 204017/11327/11333 samples in training/validation/test set. CNN/Daily Mail (Hermann et al., 2015) ... There are 287113/13368/11490 samples in training/validation/test set. |
| Hardware Specification | Yes | We conduct experiments on a NVIDIA Tesla V100 PCIe 16GB. |
| Software Dependencies | Yes | For Transformer model and BART model, we use implementations in Fair Seq (Ott et al., 2019) v0.9.0; for GPT-2 model, we use Huggingface Transformers (Wolf et al., 2020) v3.0.2. |
| Experiment Setup | Yes | In SQu AD 1.1 task, we set length penalty to 1.0, max input length is 512, and beam size is 4. In XSum task, we use parameters listed in BART2, with length penalty 1.0, max input length 1024, max output length 60, min output length 10, and beam size 6. In CNN/Daily Mail task, we conduct experiments for both BART and GPT-2. For BART model, we follow their parameters, with length penalty 2.0, max input length 1024, max output length 140, min output length 55, and beam size 4. For GPT-2 model, following their paper, we directly use pretrained model checkpoint to generate summarization, max input length is 512, max output length is set to 200. |