What Makes Convolutional Models Great on Long Sequence Modeling?

Authors: Yuhong Li, Tianle Cai, Yi Zhang, Deming Chen, Debadeepta Dey

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, SGConv improves S4 by more than 1% and achieves So TA results on the LRA benchmark. On Speech Command datasets, SGConv achieves comparative results in the tenclass classification task and significantly better results in the 35-class classification task upon previous So TA. 4 EXPERIMENTS In this section, we first test the effectiveness of SGConv on two standard long sequence modeling tasks, i.e., Long Range Arena (Tay et al., 2020b) and Speech Commands (Warden, 2018), and compare it with S4 and other baselines. We also conduct ablation studies over the decay speed and scale dimension d and evaluate the speed of SGConv on LRA.
Researcher Affiliation Collaboration Yuhong Li1 Tianle Cai2 Yi Zhang3 Deming Chen1 Debadeepta Dey3 1University of Illinois Urbana-Champaign, 2Princeton University, 3Microsoft Research.
Pseudocode Yes B.2 PYTHON STYLE PSEUDO-CODE
Open Source Code Yes Code is available.
Open Datasets Yes Long Range Arena (LRA) (Tay et al., 2020b), Speech Command (SC) dataset (Warden, 2018), Wiki Text-103 (Merity et al., 2016), GLUE benchmark (Wang et al., 2019), Books Corpus (Zhu et al., 2015) and English Wikipedia (Foundation), Image Net-1k (Deng et al., 2009)
Dataset Splits Yes Long Range Arena (Tay et al., 2020b), Speech Command (Warden, 2018), Wiki Text-103 (Merity et al., 2016), GLUE benchmark (Wang et al., 2019), Image Net-1k (Deng et al., 2009) - These are standard benchmarks with predefined splits.
Hardware Specification No Table 2: Comparison of the inference and backpropagation time (ms/batch) of S4 and SGConv blocks (number of channels 128, batch size 64) on CPU and GPU. It only mentions "CPU" and "GPU" without specifying specific models or configurations.
Software Dependencies No The paper mentions general deep learning frameworks like TensorFlow, MXNet, MindSpore, PaddlePaddle, and implies PyTorch (via `F.interpolate` and `torch.fft`), but it does not specify version numbers for any of these software dependencies.
Experiment Setup Yes Table 6 lists the detailed hyperparameters used in LRA. In most experiments, we set α to 1/2, which approximately decays in speed 1/pos. Table 2: Comparison of the inference and backpropagation time (ms/batch) of S4 and SGConv blocks (number of channels 128, batch size 64). We set both the attention and memory length to 384 for 18L model and 192 for 16L model... The SGConv has 96 as the scale dimension... except the batch size which is reduced to 64.