reproducibilityindex.ai

Viewing Transformers Through the Lens of Long Convolutions Layers

Authors: Itamar Zimerman, Lior Wolf

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our theory and experiments also shed light on the reasons for the inferior performance of transformers on long-range tasks and identify critical properties that are essential for successfully capturing long-range dependencies.
Researcher Affiliation	Academia	1The Blavatnik School of Computer Science, Tel Aviv University. Correspondence to: Itamar Zimerman <zimerman1@mail.tau.ac.il>, Lior Wolf <wolf@cs.tau.ac.il>.
Pseudocode	No	The paper describes algorithms and formulations in text and mathematical equations but does not include explicit pseudocode blocks or figures labeled as 'Algorithm' or 'Pseudocode'.
Open Source Code	No	The paper does not provide an explicit statement about open-sourcing the code for its methodology, nor does it include a link to a code repository.
Open Datasets	Yes	The Long Range Arena (LRA) benchmark... This benchmark highlights that standard sequence models, such as transformers, perform poorly even on seemingly simple long-range tasks. As modern deep learning heavily relies on transformers, understanding why transformers do not perform well on these tasks, or how to improve those abilities is an essential research topic.
Dataset Splits	Yes	(i) We observe that when training vanilla transformers (equipped with positional encoding) on the LRA benchmarks including the validation set, large transformers can achieve near 100% accuracy, illustrating their capability to shatter the LRA benchmarks.
Hardware Specification	Yes	The experiments were executed on a single V100 GPU, each running for a maximum duration of two days.
Software Dependencies	No	The paper mentions using 'Py Torch' and building its repository 'upon the existing S4 repository' but does not specify version numbers for PyTorch or any other software libraries or dependencies, which is necessary for reproducible software descriptions.
Experiment Setup	Yes	The experimental setup remains consistent across all subsections, and is described in detail in Appendix. A. Additional experiments are introduced in the appendix... Our training procedure and hyperparameters remained aligned with the configurations pre-specified in the S4 repository for analogous tasks. Exceptions include modifications aimed at saving computational resources, such as reducing model width, decreasing the number of epochs, and adjusting batch size, which were not optimized. The learning rate, which was determined through a grid search over the range [1e-3, 1e-4] and the orignal learning rate in the S4 repository for the corresponding task, and setting the dropout for 0 in all experiments. The hyperparameters of the La S attention layer are: (i) the value of B, which controls the values of αc, and (ii) the window size in the 1-D average pooling layer in the smooth operator, denoted by P. Hyperparameter tuning was executed via grid search on the following grid: B [0.0001, 0.001], Pin[3, 5]. The final set of hyperparameters for each task is presented in Tab 9.