PoNet: Pooling Network for Efficient Token Mixing in Long Sequences

Authors: Chao-Hong Tan, Qian Chen, Wen Wang, Qinglin Zhang, Siqi Zheng, Zhen-Hua Ling

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On the Long Range Arena benchmark, Po Net significantly outperforms Transformer and achieves competitive accuracy, while being only slightly slower than the fastest model, FNet, across all sequence lengths measured on GPUs. We also conduct systematic studies on the transfer learning capability of Po Net and observe that Po Net achieves 95.7% of the accuracy of BERT on the GLUE benchmark, outperforming FNet by 4.5% relative. Comprehensive ablation analysis demonstrates effectiveness of the designed multi-granularity pooling and pooling fusion for token mixing in long sequences and efficacy of the designed pre-training tasks for Po Net to learn transferable contextualized language representations.
Researcher Affiliation Collaboration Chao-Hong Tan1 , Qian Chen2, Wen Wang2, Qinglin Zhang2, Siqi Zheng2, Zhen-Hua Ling1 1National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China 2Speech Lab, Alibaba Group 1chtan@mail.ustc.edu.cn, zhling@ustc.edu.cn 2{tanqing.cq, w.wang, qinglin.zql, zsq174630}@alibaba-inc.com
Pseudocode No The paper describes its methods using mathematical equations and prose, but does not include formal pseudocode or algorithm blocks.
Open Source Code Yes Our implementation is available at https://github.com/lxchtan/Po Net.
Open Datasets Yes All data used in the experiments in this paper are open source. Readers can refer to the original papers for details of the datasets, which are cited in our paper.
Dataset Splits Yes Table 3 shows the results for the best base learning rate (no early stopping) on the GLUE Validation split (see Appendix A.3 for more details), providing a fair comparison since all three models are pre-trained and fine-tuned with the same pre-training data (5GB data)/tasks/hyperparameters with 340K steps.
Hardware Specification Yes Table 2 compares the GPU training speed and peak memory consumption of Po Net to Transformer, Performer, Nystr omformer, and FNet on a single NVIDIA V100 chip, on input sequence lengths from 512 up to 16384.
Software Dependencies No The paper mentions using 'Py Torch codebase' and references 'Wolf et al. (2020)' for the Pytorch codebase, but does not provide specific version numbers for PyTorch or any other software dependencies.
Experiment Setup Yes All baseline models and Po Net use the same Base model configuration as BERT-Base (Devlin et al., 2019). Po Net-Base has 124M parameters (see the first paragraph in Appendix A for more details). Table 6: Detailed hyperparameter settings for the pre-training and fine-tuning experiments.