What Makes Convolutional Models Great on Long Sequence Modeling?
Authors: Yuhong Li, Tianle Cai, Yi Zhang, Deming Chen, Debadeepta Dey
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, SGConv improves S4 by more than 1% and achieves So TA results on the LRA benchmark. On Speech Command datasets, SGConv achieves comparative results in the tenclass classification task and significantly better results in the 35-class classification task upon previous So TA. 4 EXPERIMENTS In this section, we first test the effectiveness of SGConv on two standard long sequence modeling tasks, i.e., Long Range Arena (Tay et al., 2020b) and Speech Commands (Warden, 2018), and compare it with S4 and other baselines. We also conduct ablation studies over the decay speed and scale dimension d and evaluate the speed of SGConv on LRA. |
| Researcher Affiliation | Collaboration | Yuhong Li1 Tianle Cai2 Yi Zhang3 Deming Chen1 Debadeepta Dey3 1University of Illinois Urbana-Champaign, 2Princeton University, 3Microsoft Research. |
| Pseudocode | Yes | B.2 PYTHON STYLE PSEUDO-CODE |
| Open Source Code | Yes | Code is available. |
| Open Datasets | Yes | Long Range Arena (LRA) (Tay et al., 2020b), Speech Command (SC) dataset (Warden, 2018), Wiki Text-103 (Merity et al., 2016), GLUE benchmark (Wang et al., 2019), Books Corpus (Zhu et al., 2015) and English Wikipedia (Foundation), Image Net-1k (Deng et al., 2009) |
| Dataset Splits | Yes | Long Range Arena (Tay et al., 2020b), Speech Command (Warden, 2018), Wiki Text-103 (Merity et al., 2016), GLUE benchmark (Wang et al., 2019), Image Net-1k (Deng et al., 2009) - These are standard benchmarks with predefined splits. |
| Hardware Specification | No | Table 2: Comparison of the inference and backpropagation time (ms/batch) of S4 and SGConv blocks (number of channels 128, batch size 64) on CPU and GPU. It only mentions "CPU" and "GPU" without specifying specific models or configurations. |
| Software Dependencies | No | The paper mentions general deep learning frameworks like TensorFlow, MXNet, MindSpore, PaddlePaddle, and implies PyTorch (via `F.interpolate` and `torch.fft`), but it does not specify version numbers for any of these software dependencies. |
| Experiment Setup | Yes | Table 6 lists the detailed hyperparameters used in LRA. In most experiments, we set α to 1/2, which approximately decays in speed 1/pos. Table 2: Comparison of the inference and backpropagation time (ms/batch) of S4 and SGConv blocks (number of channels 128, batch size 64). We set both the attention and memory length to 384 for 18L model and 192 for 16L model... The SGConv has 96 as the scale dimension... except the batch size which is reduced to 64. |