Simple Hardware-Efficient Long Convolutions for Sequence Modeling
Authors: Daniel Y Fu, Elliot L Epstein, Eric Nguyen, Armin W Thomas, Michael Zhang, Tri Dao, Atri Rudra, Christopher Re
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate how well long convolutions perform in a variety of challenging sequence modeling tasks from diverse modalities and benchmarks, including the long range arena benchmark, image classification, text modeling, and brain data modeling (Section 4.1). Table 4 shows the results for long convolutions on the LRA benchmark. Tables 5 and 6 show the results. On 1D image classification, long convolutions again match the performance of S4, even with random initializations, while their performance improves further by 1.1 points when using the geometric initialization. |
| Researcher Affiliation | Academia | 1Department of Computer Science, Stanford University, Stanford, CA, USA 2Institute of Computational and Mathematical Engineering, Stanford University, Stanford, CA, USA 3Department of Bioengineering, Stanford University, Stanford, CA, USA 4Department of Psychology, Stanford University, Stanford, CA, USA 5Department of Computer Science and Engineering, University at Buffalo, Buffalo, NY, USA. |
| Pseudocode | Yes | Algorithm 1 Regularized Long Convolution, Algorithm 2 FLASHBUTTERFLY |
| Open Source Code | Yes | Our code is available at https://github.com/ Hazy Research/safari. |
| Open Datasets | Yes | Long Range Arena (LRA) (Tay et al., 2020), Open Web Text (Gokaslan et al., 2019) and the Pile (Gao et al., 2020), ETTh1, a real-world long sequence time series forecasting task from the Informer benchmark (Zhou et al., 2021). |
| Dataset Splits | Yes | We randomly divide the upstream data, which spans f MRI data from 11,980 experimental runs of 1,726 individuals, into distinct training and validation datasets by randomly designating 5% of the f MRI runs as validation data and using the rest of the runs for training. We randomly split each of the two downstream datasets into distinct training (90% of f MRI runs) and test (10% of f MRI runs) datasets and adapt models for 1,000 training steps at a mini-batch size of 256 and a learning rate of 5e 5 (otherwise using the same learning parameters as for upstream training). |
| Hardware Specification | Yes | The LRA experiments, except for Path-X, were swept on a heterogeneous cluster of 1x V100 and 2x V100 nodes. Path-X and sequential CIFAR were run on single 8x A100 nodes. The language modeling experiments were run on a single 8x A100 node. The time series experiments were run on a cluster with 1x P100 nodes. The brain f MRI experiments were run on a cluster of 2x V100 nodes. |
| Software Dependencies | No | The paper mentions software like 'cu FFT' and 'Py Torch' but does not specify their version numbers, which are necessary for reproducible software dependencies. It also mentions 'Hugging Face implementation' without a specific version. |
| Experiment Setup | Yes | Table 16. The values of the best hyperparameters found; LRA, images, language, and time series, and brain f MRI. LR is learning rate and WD is weight decay. BN and LN refer to Batch Normalization and Layer Normalization. We use random weight initialization in all runs. This table includes Kernel Dropout, Kernel LR, λ, Batch Size, WD, Epochs, and LR for various tasks. |