Diffuser: Efficient Transformers with Multi-Hop Attention Diffusion for Long Sequences

Authors: Aosong Feng, Irene Li, Yuang Jiang, Rex Ying

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimentally, we investigate the effectiveness of Diffuser with extensive evaluations, including language modeling, image modeling, and Long Range Arena (LRA). Evaluation results show that Diffuser achieves improvements by an average of 0.94% on text classification tasks and 2.30% on LRA, with 1.67 memory savings compared to state-of-the-art benchmarks, which demonstrates superior performance of Diffuser in both expressiveness and efficiency aspects.
Researcher Affiliation Academia Aosong Feng, Irene Li, Yuang Jiang, Rex Ying Yale University, New Haven, CT, USA aosong.feng@yale.edu, irene.li@yale.edu, yuang.jiang@yale.edu, rex.ying@yale.edu
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks, only descriptive text and mathematical equations for the methods.
Open Source Code No The paper states, 'We implement Diffuser using the graph library DGL', but it does not provide concrete access to its own source code (e.g., a specific repository link, an explicit code release statement, or code in supplementary materials).
Open Datasets Yes We pretrain the model with three standard datasets (detailed in Appendix) and evaluate the pretraining performance with bits per character (BPC) as in Zaheer et al. (2020).
Dataset Splits Yes We randomly split 8/1/1 as train/dev/test sets for both datasets (statistics detailed in Appendix).
Hardware Specification No The paper discusses 'GPU memory usage' and 'seconds/iteration' but does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running experiments.
Software Dependencies No The paper mentions 'Py Torch' and 'DGL' but does not provide specific version numbers for these or any other software dependencies needed to replicate the experiments.
Experiment Setup Yes The training is conducted with the maximum sequence length of 4,096 and linear warmup from the Ro BERTa checkpoint. The detailed experimental settings, hyperparameters and baseline setup are discussed in Appendix.