Diffuser: Efficient Transformers with Multi-Hop Attention Diffusion for Long Sequences
Authors: Aosong Feng, Irene Li, Yuang Jiang, Rex Ying
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimentally, we investigate the effectiveness of Diffuser with extensive evaluations, including language modeling, image modeling, and Long Range Arena (LRA). Evaluation results show that Diffuser achieves improvements by an average of 0.94% on text classification tasks and 2.30% on LRA, with 1.67 memory savings compared to state-of-the-art benchmarks, which demonstrates superior performance of Diffuser in both expressiveness and efficiency aspects. |
| Researcher Affiliation | Academia | Aosong Feng, Irene Li, Yuang Jiang, Rex Ying Yale University, New Haven, CT, USA aosong.feng@yale.edu, irene.li@yale.edu, yuang.jiang@yale.edu, rex.ying@yale.edu |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks, only descriptive text and mathematical equations for the methods. |
| Open Source Code | No | The paper states, 'We implement Diffuser using the graph library DGL', but it does not provide concrete access to its own source code (e.g., a specific repository link, an explicit code release statement, or code in supplementary materials). |
| Open Datasets | Yes | We pretrain the model with three standard datasets (detailed in Appendix) and evaluate the pretraining performance with bits per character (BPC) as in Zaheer et al. (2020). |
| Dataset Splits | Yes | We randomly split 8/1/1 as train/dev/test sets for both datasets (statistics detailed in Appendix). |
| Hardware Specification | No | The paper discusses 'GPU memory usage' and 'seconds/iteration' but does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running experiments. |
| Software Dependencies | No | The paper mentions 'Py Torch' and 'DGL' but does not provide specific version numbers for these or any other software dependencies needed to replicate the experiments. |
| Experiment Setup | Yes | The training is conducted with the maximum sequence length of 4,096 and linear warmup from the Ro BERTa checkpoint. The detailed experimental settings, hyperparameters and baseline setup are discussed in Appendix. |