Reparameterized Multi-Resolution Convolutions for Long Sequence Modelling

Authors: Jake Cunningham, Giorgio Giannone, Mingtian Zhang, Marc Deisenroth

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate state-of-the-art performance on the Long Range Arena, Sequential CIFAR, and Speech Commands tasks among convolution models and linear-time transformers.
Researcher Affiliation Collaboration Harry Jake Cunningham University College London Giorgio Giannone University College London Amazon Mingtian Zhang University College London Marc Peter Deisenroth University College London
Pseudocode Yes Algorithm 1 MRConv, Dilated; Algorithm 2 MRConv, Fourier; Algorithm 3 MRConv, Aggregation (Appendix C)
Open Source Code Yes We use open-sourced datasets and release the code used to run our experiments.
Open Datasets Yes The Long Range Arena (LRA) benchmark [44] evaluates the performance of sequence models on long-range modelling tasks on a wide range of data modalities and sequence lengths from 1,024 to 16,000. ... We also evaluate MRConv on the sequential CIFAR (s CIFAR) image classification task... The Speech Commands (SC) dataset [47] contains 1s sound recordings... To evaluate MRConv on a large-scale task, we employ the Image Net classification benchmark [41]...
Dataset Splits Yes There are 96,000 training examples, 2,000 validation examples and 2,000 test sequences. (List Ops in Appendix D.3)
Hardware Specification Yes The evaluation was conducted on an NVIDIA A100-40GB GPU... All LRA, s CIFAR and Speech Commands experiments were run using a single 40GB A100 GPU apart from Retrieval and Path-X and Speech Commands where we use 2 40GB A100s. We used 8 V100 for training (ImageNet Classification). The throughput is measured... on a single 24GB 3090 GPU...
Software Dependencies No The paper mentions software like Flash Attention, Flash FFTConv, and PyTorch, but does not provide specific version numbers for these or other key software dependencies.
Experiment Setup Yes Table 5 presents the highest performing hyperparameters for each base model used for each experiment. For all experiments we ensure that the total number of trainable parameters stays comparable with baseline methods. ... We follow the optimization approach presented in [23], which uses the Adam W optimizer with a global learning rate and weight decay and a separate smaller learning rate with no weight decay specifically for the kernel parameters. All experiments use a cosine annealing learning rate schedule with linear warmup.