Poolingformer: Long Document Modeling with Pooling Attention
Authors: Hang Zhang, Yeyun Gong, Yelong Shen, Weisheng Li, Jiancheng Lv, Nan Duan, Weizhu Chen
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We first evaluate Poolingformer on two long sequence QA tasks: the monolingual NQ and the multilingual Ty Di QA. Experimental results show that Poolingformer sits atop three official leaderboards measured by F1, outperforming previous state-of-the-art models by 1.9 points (79.8 vs. 77.9) on NQ long answer, 1.9 points (79.5 vs. 77.6) on Ty Di QA passage answer, and 1.6 points (67.6 vs. 66.0) on Ty Di QA minimal answer. We further evaluate Poolingformer on a long sequence summarization task. Experimental results on the ar Xiv benchmark continue to demonstrate its superior performance. |
| Researcher Affiliation | Collaboration | 1College of Computer Science, Sichuan University 2During Internship at MSRA 3Microsoft Research Asia 4Microsoft Azure AI 5University of Science and Technology of China. |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide a direct link or explicit statement about the availability of its source code. |
| Open Datasets | Yes | For QA, we report the results on the monolingual Natural Question (NQ) and the multilingual Ty Di QA. For long document summarization, we report the results on the ar Xiv dataset (Cohan et al., 2018). Natural Questions: This dataset collected real questions in Google s search engine. Each question is paired with a Wikipedia page. [...] https://ai.google.com/research/ Natural Questions/dataset. Ty Di QA: Ty Di QA is a multilingual question answering dataset [...] https://ai.google.com/research/tydiqa. ar Xiv: ar Xiv (Cohan et al., 2018) is a long document summarization dataset collected from scientific repositories arxiv.org. |
| Dataset Splits | Yes | For NQ and Ty Di QA , We split documents into multiple spans with a sliding window approach (Alberti et al., 2019). The size and stride of the sliding window are set to 4,096 and 1,568, respectively. Each instance is formed by a start placeholder, a question, and a document span. The question and the document span are separated by a special placeholder. Since many instances contain no answer, the number of negative instances and positive instances is imbalanced. We follow Liu et al. (2020) to sub-sample negative instances during training. The ratio of the sub-sampling set to 0.5. |
| Hardware Specification | Yes | For all experiments, we use 8 NVIDIA Tesla V100 GPUs. |
| Software Dependencies | No | The paper mentions 'Huggingface Transformers (Wolf et al., 2020) and Fairseq (Ott et al., 2019)' and 'Apex3' but does not specify exact version numbers for these software dependencies. |
| Experiment Setup | Yes | The window sizes of the first-level and second-level is set to 128 and 512, respectively. The pooling kernel size, stride size are set to 5, 4. We use Adam optimizer (Kingma & Ba, 2015) with linear learning rate decay. The batch size, the training epoch, the learning rate, and the learning rate warmup proportion are set to 64, 2, 2 10 5 and 0.1 respectively. |