reproducibilityindex.ai

Diffuser: Efficient Transformers with Multi-Hop Attention Diffusion for Long Sequences

Authors: Aosong Feng, Irene Li, Yuang Jiang, Rex Ying

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimentally, we investigate the effectiveness of Diffuser with extensive evaluations, including language modeling, image modeling, and Long Range Arena (LRA). Evaluation results show that Diffuser achieves improvements by an average of 0.94% on text classification tasks and 2.30% on LRA, with 1.67 memory savings compared to state-of-the-art benchmarks, which demonstrates superior performance of Diffuser in both expressiveness and efficiency aspects.
Researcher Affiliation	Academia	Aosong Feng, Irene Li, Yuang Jiang, Rex Ying Yale University, New Haven, CT, USA aosong.feng@yale.edu, irene.li@yale.edu, yuang.jiang@yale.edu, rex.ying@yale.edu
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks, only descriptive text and mathematical equations for the methods.
Open Source Code	No	The paper states, 'We implement Diffuser using the graph library DGL', but it does not provide concrete access to its own source code (e.g., a specific repository link, an explicit code release statement, or code in supplementary materials).
Open Datasets	Yes	We pretrain the model with three standard datasets (detailed in Appendix) and evaluate the pretraining performance with bits per character (BPC) as in Zaheer et al. (2020).
Dataset Splits	Yes	We randomly split 8/1/1 as train/dev/test sets for both datasets (statistics detailed in Appendix).
Hardware Specification	No	The paper discusses 'GPU memory usage' and 'seconds/iteration' but does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running experiments.
Software Dependencies	No	The paper mentions 'Py Torch' and 'DGL' but does not provide specific version numbers for these or any other software dependencies needed to replicate the experiments.
Experiment Setup	Yes	The training is conducted with the maximum sequence length of 4,096 and linear warmup from the Ro BERTa checkpoint. The detailed experimental settings, hyperparameters and baseline setup are discussed in Appendix.