reproducibilityindex.ai

Reparameterized Multi-Resolution Convolutions for Long Sequence Modelling

Authors: Jake Cunningham, Giorgio Giannone, Mingtian Zhang, Marc Deisenroth

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate state-of-the-art performance on the Long Range Arena, Sequential CIFAR, and Speech Commands tasks among convolution models and linear-time transformers.
Researcher Affiliation	Collaboration	Harry Jake Cunningham University College London Giorgio Giannone University College London Amazon Mingtian Zhang University College London Marc Peter Deisenroth University College London
Pseudocode	Yes	Algorithm 1 MRConv, Dilated; Algorithm 2 MRConv, Fourier; Algorithm 3 MRConv, Aggregation (Appendix C)
Open Source Code	Yes	We use open-sourced datasets and release the code used to run our experiments.
Open Datasets	Yes	The Long Range Arena (LRA) benchmark [44] evaluates the performance of sequence models on long-range modelling tasks on a wide range of data modalities and sequence lengths from 1,024 to 16,000. ... We also evaluate MRConv on the sequential CIFAR (s CIFAR) image classification task... The Speech Commands (SC) dataset [47] contains 1s sound recordings... To evaluate MRConv on a large-scale task, we employ the Image Net classification benchmark [41]...
Dataset Splits	Yes	There are 96,000 training examples, 2,000 validation examples and 2,000 test sequences. (List Ops in Appendix D.3)
Hardware Specification	Yes	The evaluation was conducted on an NVIDIA A100-40GB GPU... All LRA, s CIFAR and Speech Commands experiments were run using a single 40GB A100 GPU apart from Retrieval and Path-X and Speech Commands where we use 2 40GB A100s. We used 8 V100 for training (ImageNet Classification). The throughput is measured... on a single 24GB 3090 GPU...
Software Dependencies	No	The paper mentions software like Flash Attention, Flash FFTConv, and PyTorch, but does not provide specific version numbers for these or other key software dependencies.
Experiment Setup	Yes	Table 5 presents the highest performing hyperparameters for each base model used for each experiment. For all experiments we ensure that the total number of trainable parameters stays comparable with baseline methods. ... We follow the optimization approach presented in [23], which uses the Adam W optimizer with a global learning rate and weight decay and a separate smaller learning rate with no weight decay specifically for the kernel parameters. All experiments use a cosine annealing learning rate schedule with linear warmup.