CAB: Comprehensive Attention Benchmarking on Long Sequence Modeling

Authors: Jun Zhang, Shuyang Jiang, Jiangtao Feng, Lin Zheng, Lingpeng Kong

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct exhaustive experiments to benchmark the performances of nine widely-used efficient attention architectures designed with different philosophies on CAB. Extensive experimental results also shed light on the fundamental problems of efficient attentions, such as efficiency length against vanilla attention, performance consistency across attention patterns, the benefit of attention mechanisms, and interpolation/extrapolation on long-context language modeling.
Researcher Affiliation Academia 1Shanghai AI Laboratory 2Shanghai Jiao Tong University 3The University of Hong Kong. Correspondence to: Jiangtao Feng <jiangtaofeng0906@gmail.com>, Lingpeng Kong <lpk@cs.hku.hk>.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks (clearly labeled algorithm sections or code-like formatted procedures).
Open Source Code Yes CAB and all the related codes will be released at https://github.com/Shark-NLP/CAB.
Open Datasets Yes The CAB incorporates the LJSpeech dataset (Ito, 2017)... We consider multi-document summarization task and use Multi-News datasets (Fabbri et al., 2019)... This task evaluates models on three datasets, including Electricity Transformer Temperature (ETT), Electricity Consuming Load (ECL), and Weather (Zhou et al., 2021a)... CAB adopts PCN dataset (Griffiths & Boehm, 2019)... We use PG-19 dataset (Rae et al., 2019)... we train the backbone model SR3 on Flickr-Faces-HQ (FFHQ) dataset (Karras et al., 2019) and conduct evaluation on Celeb A-HQ dataset (Karras et al., 2018).
Dataset Splits No The paper describes the datasets used and some training parameters, but does not explicitly provide specific percentages, counts, or references to predefined train/validation/test dataset splits. For instance, in Section 3.1, while data lengths are mentioned, the partitioning strategy is not detailed.
Hardware Specification Yes The experiment is performed on a single A100 GPU, where attention mechanisms are fed a set of dummy sequences with lengths of {256, 512, 1024, 2048, 4096, 8192}.
Software Dependencies No The paper mentions software like FAIRSEQ and PARAGEN, and specific libraries like PyTorch's multi-head attention, but does not provide specific version numbers for these software dependencies (e.g., 'PyTorch 1.9' or 'FAIRSEQ 0.10.0').
Experiment Setup Yes Hyperparameters for all tasks are shown in Table 4. We also report the hyperparameters of efficient attentions in Table 5. Table 4 lists 'Batch Size', 'Number of Steps', 'Warmup Steps', 'Peak Learning Rate', 'Scheduler', 'Optimizer', 'Clip Norm', 'Attention Dropout', 'Weight Decay', 'Tokens per Batch'.