reproducibilityindex.ai

PoNet: Pooling Network for Efficient Token Mixing in Long Sequences

Authors: Chao-Hong Tan, Qian Chen, Wen Wang, Qinglin Zhang, Siqi Zheng, Zhen-Hua Ling

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On the Long Range Arena benchmark, Po Net significantly outperforms Transformer and achieves competitive accuracy, while being only slightly slower than the fastest model, FNet, across all sequence lengths measured on GPUs. We also conduct systematic studies on the transfer learning capability of Po Net and observe that Po Net achieves 95.7% of the accuracy of BERT on the GLUE benchmark, outperforming FNet by 4.5% relative. Comprehensive ablation analysis demonstrates effectiveness of the designed multi-granularity pooling and pooling fusion for token mixing in long sequences and efficacy of the designed pre-training tasks for Po Net to learn transferable contextualized language representations.
Researcher Affiliation	Collaboration	Chao-Hong Tan1 , Qian Chen2, Wen Wang2, Qinglin Zhang2, Siqi Zheng2, Zhen-Hua Ling1 1National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China 2Speech Lab, Alibaba Group 1chtan@mail.ustc.edu.cn, zhling@ustc.edu.cn 2{tanqing.cq, w.wang, qinglin.zql, zsq174630}@alibaba-inc.com
Pseudocode	No	The paper describes its methods using mathematical equations and prose, but does not include formal pseudocode or algorithm blocks.
Open Source Code	Yes	Our implementation is available at https://github.com/lxchtan/Po Net.
Open Datasets	Yes	All data used in the experiments in this paper are open source. Readers can refer to the original papers for details of the datasets, which are cited in our paper.
Dataset Splits	Yes	Table 3 shows the results for the best base learning rate (no early stopping) on the GLUE Validation split (see Appendix A.3 for more details), providing a fair comparison since all three models are pre-trained and fine-tuned with the same pre-training data (5GB data)/tasks/hyperparameters with 340K steps.
Hardware Specification	Yes	Table 2 compares the GPU training speed and peak memory consumption of Po Net to Transformer, Performer, Nystr omformer, and FNet on a single NVIDIA V100 chip, on input sequence lengths from 512 up to 16384.
Software Dependencies	No	The paper mentions using 'Py Torch codebase' and references 'Wolf et al. (2020)' for the Pytorch codebase, but does not provide specific version numbers for PyTorch or any other software dependencies.
Experiment Setup	Yes	All baseline models and Po Net use the same Base model configuration as BERT-Base (Devlin et al., 2019). Po Net-Base has 124M parameters (see the first paragraph in Appendix A for more details). Table 6: Detailed hyperparameter settings for the pre-training and fine-tuning experiments.