Viewing Transformers Through the Lens of Long Convolutions Layers
Authors: Itamar Zimerman, Lior Wolf
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our theory and experiments also shed light on the reasons for the inferior performance of transformers on long-range tasks and identify critical properties that are essential for successfully capturing long-range dependencies. |
| Researcher Affiliation | Academia | 1The Blavatnik School of Computer Science, Tel Aviv University. Correspondence to: Itamar Zimerman <zimerman1@mail.tau.ac.il>, Lior Wolf <wolf@cs.tau.ac.il>. |
| Pseudocode | No | The paper describes algorithms and formulations in text and mathematical equations but does not include explicit pseudocode blocks or figures labeled as 'Algorithm' or 'Pseudocode'. |
| Open Source Code | No | The paper does not provide an explicit statement about open-sourcing the code for its methodology, nor does it include a link to a code repository. |
| Open Datasets | Yes | The Long Range Arena (LRA) benchmark... This benchmark highlights that standard sequence models, such as transformers, perform poorly even on seemingly simple long-range tasks. As modern deep learning heavily relies on transformers, understanding why transformers do not perform well on these tasks, or how to improve those abilities is an essential research topic. |
| Dataset Splits | Yes | (i) We observe that when training vanilla transformers (equipped with positional encoding) on the LRA benchmarks including the validation set, large transformers can achieve near 100% accuracy, illustrating their capability to shatter the LRA benchmarks. |
| Hardware Specification | Yes | The experiments were executed on a single V100 GPU, each running for a maximum duration of two days. |
| Software Dependencies | No | The paper mentions using 'Py Torch' and building its repository 'upon the existing S4 repository' but does not specify version numbers for PyTorch or any other software libraries or dependencies, which is necessary for reproducible software descriptions. |
| Experiment Setup | Yes | The experimental setup remains consistent across all subsections, and is described in detail in Appendix. A. Additional experiments are introduced in the appendix... Our training procedure and hyperparameters remained aligned with the configurations pre-specified in the S4 repository for analogous tasks. Exceptions include modifications aimed at saving computational resources, such as reducing model width, decreasing the number of epochs, and adjusting batch size, which were not optimized. The learning rate, which was determined through a grid search over the range [1e-3, 1e-4] and the orignal learning rate in the S4 repository for the corresponding task, and setting the dropout for 0 in all experiments. The hyperparameters of the La S attention layer are: (i) the value of B, which controls the values of αc, and (ii) the window size in the 1-D average pooling layer in the smooth operator, denoted by P. Hyperparameter tuning was executed via grid search on the following grid: B [0.0001, 0.001], Pin[3, 5]. The final set of hyperparameters for each task is presented in Tab 9. |