Sparse Modular Activation for Efficient Sequence Modeling
Authors: Liliang Ren, Yang Liu, Shuohang Wang, Yichong Xu, Chenguang Zhu, Cheng Xiang Zhai
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct comprehensive experiments to show that Seq Boat has significantly better qualityefficiency trade-off than state-of-the-art hybrid models on a wide range of tasks, including Long Range Arena (LRA) [TDA+20], speech classification [War18] and language modeling [Hut06]. |
| Researcher Affiliation | Collaboration | Liliang Ren1 Yang Liu2 Shuohang Wang2 Yichong Xu Chenguang Zhu2 Chengxiang Zhai1 1University of Illinois at Urbana-Champaign 2Microsoft |
| Pseudocode | Yes | The Pytorch-like [PGM+19] code snippets of the compress and extract operators are provided in the Appendix A.1 with an efficient support of batched sequences using the scatter operation. Listing 1: Pytorch-like code snippet for the Compress operator. Listing 2: Pytorch-like code snippet for the Extract operator. |
| Open Source Code | Yes | Our code is publicly available at https://github.com/renll/Seq Boat. |
| Open Datasets | Yes | We conduct comprehensive experiments to show that Seq Boat has significantly better qualityefficiency trade-off than state-of-the-art hybrid models on a wide range of tasks, including Long Range Arena (LRA) [TDA+20], speech classification [War18] and language modeling [Hut06]. |
| Dataset Splits | Yes | We conduct comprehensive experiments to show that Seq Boat has significantly better qualityefficiency trade-off than state-of-the-art hybrid models on a wide range of tasks, including Long Range Arena (LRA) [TDA+20], speech classification [War18] and language modeling [Hut06]. We measure the mean and the standard deviation (plotted as error bars) of the activation time on 100 sequences randomly sampled from the validation set of each task. |
| Hardware Specification | Yes | All the experiments are conducted on a mixed cluster with 8 NVIDIA V100 32GB GPUs and 2 NVIDIA A5000 24GB GPUs. |
| Software Dependencies | No | The Pytorch-like [PGM+19] code snippets of the compress and extract operators are provided in the Appendix A.1 with an efficient support of batched sequences using the scatter operation. For Long Range Arena (LRA) and Speech Command tasks, we use the Adam W [LH18] optimizer. For language modeling tasks, we use the RAdam [LJH+19] optimizer. No specific version numbers for PyTorch or the optimizers are provided. |
| Experiment Setup | Yes | Table 4: Hyper-parameter Settings of our Seq Boat model for the LRA benchmark and the Speech Command (SC) dataset. DP is the dropout rate, BSZ is batch size, LR is learning rate, WD is weight decay, and Pre-N is Pre-normalization. Table 5: Hyper-parameters of our Seq Boat model for language modeling. |