reproducibilityindex.ai

What Dense Graph Do You Need for Self-Attention?

Authors: Yuxin Wang, Chu-Tak Lee, Qipeng Guo, Zhangyue Yin, Yunhua Zhou, Xuanjing Huang, Xipeng Qiu

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on tasks requiring various sequence lengths lay validation for our graph function well. Experiments are conducted to validate theoretical results of Normalized Information Payload and then show performances of Hypercube Transformer.
Researcher Affiliation	Academia	1School of Computer Science, Fudan University 2Institute of Modern Languages and Linguistics, Fudan University 3Peng Cheng Laboratory.
Pseudocode	Yes	Algorithm 1 Binary representation of sequences
Open Source Code	Yes	Code is available at https://github.com/yxzwang/Normalized Information-Payload.
Open Datasets	Yes	We rely on Long Range Arena (LRA) benchmark (Tay et al., 2020a) for validation for our graph scoring function, NIP(G). We use English Wikipedia and Book Corpus2 (Gao et al., 2020) as our pretraining datesets.
Dataset Splits	No	The paper uses standard benchmarks (LRA, Wikitext103, GLUE) but does not explicitly state specific training/validation/test dataset splits with percentages or sample counts.
Hardware Specification	Yes	All experiments are conducted on RTX 3090 GPU.
Software Dependencies	No	The paper mentions using 'triton (Tillet et al., 2019)' but does not provide a specific version number for this or any other software dependency.
Experiment Setup	Yes	We put hyper-parameters for LRA here. We set the embedding hidden size to 64 and the hidden size for attention to be 128. Dropout rate and weight decay is different for each task. Table 9. Hyper-parameters used for all models we trained on Long Range Arena. Table 11. Hyper-parameters used for Cube BERT128 pretraining. Table 12. Hyper-parameters used for downstream tasks.