reproducibilityindex.ai

Simple Hardware-Efficient Long Convolutions for Sequence Modeling

Authors: Daniel Y Fu, Elliot L Epstein, Eric Nguyen, Armin W Thomas, Michael Zhang, Tri Dao, Atri Rudra, Christopher Re

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate how well long convolutions perform in a variety of challenging sequence modeling tasks from diverse modalities and benchmarks, including the long range arena benchmark, image classiﬁcation, text modeling, and brain data modeling (Section 4.1). Table 4 shows the results for long convolutions on the LRA benchmark. Tables 5 and 6 show the results. On 1D image classiﬁcation, long convolutions again match the performance of S4, even with random initializations, while their performance improves further by 1.1 points when using the geometric initialization.
Researcher Affiliation	Academia	1Department of Computer Science, Stanford University, Stanford, CA, USA 2Institute of Computational and Mathematical Engineering, Stanford University, Stanford, CA, USA 3Department of Bioengineering, Stanford University, Stanford, CA, USA 4Department of Psychology, Stanford University, Stanford, CA, USA 5Department of Computer Science and Engineering, University at Buffalo, Buffalo, NY, USA.
Pseudocode	Yes	Algorithm 1 Regularized Long Convolution, Algorithm 2 FLASHBUTTERFLY
Open Source Code	Yes	Our code is available at https://github.com/ Hazy Research/safari.
Open Datasets	Yes	Long Range Arena (LRA) (Tay et al., 2020), Open Web Text (Gokaslan et al., 2019) and the Pile (Gao et al., 2020), ETTh1, a real-world long sequence time series forecasting task from the Informer benchmark (Zhou et al., 2021).
Dataset Splits	Yes	We randomly divide the upstream data, which spans f MRI data from 11,980 experimental runs of 1,726 individuals, into distinct training and validation datasets by randomly designating 5% of the f MRI runs as validation data and using the rest of the runs for training. We randomly split each of the two downstream datasets into distinct training (90% of f MRI runs) and test (10% of f MRI runs) datasets and adapt models for 1,000 training steps at a mini-batch size of 256 and a learning rate of 5e 5 (otherwise using the same learning parameters as for upstream training).
Hardware Specification	Yes	The LRA experiments, except for Path-X, were swept on a heterogeneous cluster of 1x V100 and 2x V100 nodes. Path-X and sequential CIFAR were run on single 8x A100 nodes. The language modeling experiments were run on a single 8x A100 node. The time series experiments were run on a cluster with 1x P100 nodes. The brain f MRI experiments were run on a cluster of 2x V100 nodes.
Software Dependencies	No	The paper mentions software like 'cu FFT' and 'Py Torch' but does not specify their version numbers, which are necessary for reproducible software dependencies. It also mentions 'Hugging Face implementation' without a specific version.
Experiment Setup	Yes	Table 16. The values of the best hyperparameters found; LRA, images, language, and time series, and brain f MRI. LR is learning rate and WD is weight decay. BN and LN refer to Batch Normalization and Layer Normalization. We use random weight initialization in all runs. This table includes Kernel Dropout, Kernel LR, λ, Batch Size, WD, Epochs, and LR for various tasks.