What Dense Graph Do You Need for Self-Attention?

Authors: Yuxin Wang, Chu-Tak Lee, Qipeng Guo, Zhangyue Yin, Yunhua Zhou, Xuanjing Huang, Xipeng Qiu

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on tasks requiring various sequence lengths lay validation for our graph function well. Experiments are conducted to validate theoretical results of Normalized Information Payload and then show performances of Hypercube Transformer.
Researcher Affiliation Academia 1School of Computer Science, Fudan University 2Institute of Modern Languages and Linguistics, Fudan University 3Peng Cheng Laboratory.
Pseudocode Yes Algorithm 1 Binary representation of sequences
Open Source Code Yes Code is available at https://github.com/yxzwang/Normalized Information-Payload.
Open Datasets Yes We rely on Long Range Arena (LRA) benchmark (Tay et al., 2020a) for validation for our graph scoring function, NIP(G). We use English Wikipedia and Book Corpus2 (Gao et al., 2020) as our pretraining datesets.
Dataset Splits No The paper uses standard benchmarks (LRA, Wikitext103, GLUE) but does not explicitly state specific training/validation/test dataset splits with percentages or sample counts.
Hardware Specification Yes All experiments are conducted on RTX 3090 GPU.
Software Dependencies No The paper mentions using 'triton (Tillet et al., 2019)' but does not provide a specific version number for this or any other software dependency.
Experiment Setup Yes We put hyper-parameters for LRA here. We set the embedding hidden size to 64 and the hidden size for attention to be 128. Dropout rate and weight decay is different for each task. Table 9. Hyper-parameters used for all models we trained on Long Range Arena. Table 11. Hyper-parameters used for Cube BERT128 pretraining. Table 12. Hyper-parameters used for downstream tasks.