Relative Positional Encoding for Transformers with Linear Complexity
Authors: Antoine Liutkus, Ondřej Cı́fka, Shih-Lun Wu, Umut Simsekli, Yi-Hsuan Yang, Gael Richard
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We illustrate the performance of our approach on the Long-Range Arena benchmark and on music generation. We study the impact of SPE on performance on the Long Range Arena benchmark (Tay et al., 2021) and two music generation tasks. Our results demonstrate better validation losses and extrapolation ability. We evaluate the proposed method in the Long-Range Arena (LRA; Tay et al., 2021), a benchmark for efficient Transformers, consisting of sequence classification tasks with a focus on long-range dependencies. The results of the benchmark are given in Table 1. |
| Researcher Affiliation | Collaboration | 1Inria, Zenith Team, UMR LIRMM, Univ. Montpellier, France 2LTCI, T el ecom Paris, Institut Polytechnique de Paris, France 3Research Center for IT Innovation, Academia Sinica, Taiwan 4National Taiwan University, Taiwan 5Taiwan AI Labs, Taiwan 6INRIA D epartement d Informatique de l Ecole Normale Sup erieure PSL Research University, Paris, France. |
| Pseudocode | Yes | Algorithm 1 Stochastic Positional Encoding. Input position kernel P(m, n), number of replicas R. initial M D and N D queries Q and keys K. |
| Open Source Code | Yes | We provide additional resources on our companion website,2 including Python implementations of SPE for Py Torch and JAX/Flax. 2https://cifkao.github.io/spe/ |
| Open Datasets | Yes | We evaluate the proposed method in the Long-Range Arena (LRA; Tay et al., 2021)... We use the following tasks from this benchmark: List Ops... Text: movie review sentiment analysis on the IMDB corpus (Maas et al., 2011); Retrieval: article similarity classification on the All About NLP (AAN) corpus (Radev et al., 2013); Image: object recognition on the CIFAR10 dataset (Krizhevsky, 2009)... We train Performers for music generation... on a dataset composed of 1 747 pop piano tracks, encoded using the recently proposed Revamped MIDI-derived format (REMI; Huang & Yang, 2020). |
| Dataset Splits | Yes | We hold out 5% of the songs as the validation set. We adopt the configuration of Tay et al., only changing the PE and the batch sizes/learning rates to allow training on limited hardware with similar results. All other hyperparameters are kept identical to the original LRA. We display validation cross-entropy computed with teacher forcing (Williams & Zipser, 1989) in Figure 3, as a function of the target token position. |
| Hardware Specification | No | The paper mentions training "on limited hardware" but does not specify any particular GPU models, CPU types, or other hardware components used for the experiments. |
| Software Dependencies | No | The paper mentions "Python implementations of SPE for Py Torch and JAX/Flax" but does not provide specific version numbers for these software dependencies, which is required for reproducibility. |
| Experiment Setup | Yes | We adopt the configuration of Tay et al., only changing the PE and the batch sizes/learning rates to allow training on limited hardware with similar results. All other hyperparameters are kept identical to the original LRA. We train Performers for music generation, with 24 layers and 8 heads per layer on a dataset... The sequences are composed of metrical tokens: bar, subbeat, and tempo... We train the models with sequence length N = 2 048... The models (24-layer Performers with 8 attention heads) are trained on an accompaniment dataset... training sequences of length N = 512... |