Reparameterized Multi-Resolution Convolutions for Long Sequence Modelling
Authors: Jake Cunningham, Giorgio Giannone, Mingtian Zhang, Marc Deisenroth
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate state-of-the-art performance on the Long Range Arena, Sequential CIFAR, and Speech Commands tasks among convolution models and linear-time transformers. |
| Researcher Affiliation | Collaboration | Harry Jake Cunningham University College London Giorgio Giannone University College London Amazon Mingtian Zhang University College London Marc Peter Deisenroth University College London |
| Pseudocode | Yes | Algorithm 1 MRConv, Dilated; Algorithm 2 MRConv, Fourier; Algorithm 3 MRConv, Aggregation (Appendix C) |
| Open Source Code | Yes | We use open-sourced datasets and release the code used to run our experiments. |
| Open Datasets | Yes | The Long Range Arena (LRA) benchmark [44] evaluates the performance of sequence models on long-range modelling tasks on a wide range of data modalities and sequence lengths from 1,024 to 16,000. ... We also evaluate MRConv on the sequential CIFAR (s CIFAR) image classification task... The Speech Commands (SC) dataset [47] contains 1s sound recordings... To evaluate MRConv on a large-scale task, we employ the Image Net classification benchmark [41]... |
| Dataset Splits | Yes | There are 96,000 training examples, 2,000 validation examples and 2,000 test sequences. (List Ops in Appendix D.3) |
| Hardware Specification | Yes | The evaluation was conducted on an NVIDIA A100-40GB GPU... All LRA, s CIFAR and Speech Commands experiments were run using a single 40GB A100 GPU apart from Retrieval and Path-X and Speech Commands where we use 2 40GB A100s. We used 8 V100 for training (ImageNet Classification). The throughput is measured... on a single 24GB 3090 GPU... |
| Software Dependencies | No | The paper mentions software like Flash Attention, Flash FFTConv, and PyTorch, but does not provide specific version numbers for these or other key software dependencies. |
| Experiment Setup | Yes | Table 5 presents the highest performing hyperparameters for each base model used for each experiment. For all experiments we ensure that the total number of trainable parameters stays comparable with baseline methods. ... We follow the optimization approach presented in [23], which uses the Adam W optimizer with a global learning rate and weight decay and a separate smaller learning rate with no weight decay specifically for the kernel parameters. All experiments use a cosine annealing learning rate schedule with linear warmup. |