PoNet: Pooling Network for Efficient Token Mixing in Long Sequences
Authors: Chao-Hong Tan, Qian Chen, Wen Wang, Qinglin Zhang, Siqi Zheng, Zhen-Hua Ling
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On the Long Range Arena benchmark, Po Net significantly outperforms Transformer and achieves competitive accuracy, while being only slightly slower than the fastest model, FNet, across all sequence lengths measured on GPUs. We also conduct systematic studies on the transfer learning capability of Po Net and observe that Po Net achieves 95.7% of the accuracy of BERT on the GLUE benchmark, outperforming FNet by 4.5% relative. Comprehensive ablation analysis demonstrates effectiveness of the designed multi-granularity pooling and pooling fusion for token mixing in long sequences and efficacy of the designed pre-training tasks for Po Net to learn transferable contextualized language representations. |
| Researcher Affiliation | Collaboration | Chao-Hong Tan1 , Qian Chen2, Wen Wang2, Qinglin Zhang2, Siqi Zheng2, Zhen-Hua Ling1 1National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China 2Speech Lab, Alibaba Group 1chtan@mail.ustc.edu.cn, zhling@ustc.edu.cn 2{tanqing.cq, w.wang, qinglin.zql, zsq174630}@alibaba-inc.com |
| Pseudocode | No | The paper describes its methods using mathematical equations and prose, but does not include formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our implementation is available at https://github.com/lxchtan/Po Net. |
| Open Datasets | Yes | All data used in the experiments in this paper are open source. Readers can refer to the original papers for details of the datasets, which are cited in our paper. |
| Dataset Splits | Yes | Table 3 shows the results for the best base learning rate (no early stopping) on the GLUE Validation split (see Appendix A.3 for more details), providing a fair comparison since all three models are pre-trained and fine-tuned with the same pre-training data (5GB data)/tasks/hyperparameters with 340K steps. |
| Hardware Specification | Yes | Table 2 compares the GPU training speed and peak memory consumption of Po Net to Transformer, Performer, Nystr omformer, and FNet on a single NVIDIA V100 chip, on input sequence lengths from 512 up to 16384. |
| Software Dependencies | No | The paper mentions using 'Py Torch codebase' and references 'Wolf et al. (2020)' for the Pytorch codebase, but does not provide specific version numbers for PyTorch or any other software dependencies. |
| Experiment Setup | Yes | All baseline models and Po Net use the same Base model configuration as BERT-Base (Devlin et al., 2019). Po Net-Base has 124M parameters (see the first paragraph in Appendix A for more details). Table 6: Detailed hyperparameter settings for the pre-training and fine-tuning experiments. |