Long Range Arena : A Benchmark for Efficient Transformers

Authors: Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, Donald Metzler

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 3 EXPERIMENTAL RESULTS. Table 1: Experimental results on Long-Range Arena benchmark.
Researcher Affiliation Industry 1Google Research 2Google Deep Mind {yitay, dehghani}@google.com
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No Our framework, which we plan to open source, is written in JAX/FLAX.
Open Datasets Yes We use the IMDb reviews (Maas et al., 2011) dataset, which is a commonly used dataset to benchmark document classification. We use the ACL Anthology Network (AAN; Radev et al., 2013) dataset. In LRA, we use the CIFAR-10 dataset (Krizhevsky, 2009) for the image classification task.
Dataset Splits Yes averaged over 1K random samples from the validation set.
Hardware Specification Yes Benchmarks are run on 4x4 TPU V3 Chips. We conduct experiments on 4x4 TPU V3 Chips.
Software Dependencies No Our framework, which we plan to open source, is written in JAX/FLAX1. We implement our benchmark (which includes the task, evaluators, and models) in Python 3 and Jax/Flax. No specific version numbers for JAX/FLAX or other libraries are provided.
Experiment Setup Yes All our xformer models have an embedding dimension of 512, 8 heads, 6 layers and a feed-forward dimensions of 2048. We train all models for 5K steps. All xformer models are parameterized by the same number of layers, heads and hidden dimensions, namely 8 heads, 512 hidden dimensions and d = 2048 for positional FFN layers. We use 6 layers for all xformers. The learning rate is 0.05 with weight decay of 0.1. We use Adam with warmup. All models are trained for 20K steps and a batch size of 32.