Your Transformer May Not be as Powerful as You Expect

Authors: Shengjie Luo, Shanda Li, Shuxin Zheng, Tie-Yan Liu, Liwei Wang, Di He

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we empirically study the effectiveness of the proposed model. In particular, we aim at answering the following questions through experiments: Question 1: Can the theoretical results on the approximation capability of RPE-based Transformer and URPE-based Transformer be reflected in certain experiments? Question 2: With different RPE methods (the matrix B in Eq.(8)), can URPE-based Transformer outperform its RPE-based counterpart in real-world applications? Question 3: Can URPE-based Attention serve as a versatile module to improve the general Transformers beyond language tasks? We will answer each question with carefully designed experiments in the following sub-sections.
Researcher Affiliation Collaboration 1National Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University 2Machine Learning Department, School of Computer Science, Carnegie Mellon University 3Microsoft Research 4Center for Data Science, Peking University 5Zhejiang Lab
Pseudocode No No pseudocode or clearly labeled algorithm block was found in the paper. Methods are described using mathematical equations and textual explanations.
Open Source Code Yes The code will be made publicly available at https://github.com/lsj2408/URPE.
Open Datasets Yes We conduct experiments on the Wiki Text-103 dataset [47]... ZINC from Benchmarking-GNNs [16] and PCQM4M from Open Graph Benchmark Large Scale Challenge (OGB-LSC) [29].
Dataset Splits Yes We show the perplexity scores on both validation and test set of different models in Table 1. ... Table 3: Results on PCQM4M from OGB-LSC. ... Valid MAE
Hardware Specification Yes We run profiling of all the models on a 16GB NVIDIA Tesla V100 GPU.
Software Dependencies No The paper mentions that code is implemented based on Fairseq, Graphormer, and PyTorch, but it does not specify version numbers for these software components.
Experiment Setup Yes The number of layers and the number of attention heads are set to 3 and 12, respectively. The hidden dimension is set to 768. ... The number of layers and the number of attention heads are set to 16 and 10 respectively. The dimension of hidden layers and feed-forward layers are set to 410 and 2100. ... The dimension of hidden layers and feed-forward layers are set to 80. The number of attention heads are set to 32. ... our Graphormer with URPE-based Attention consists of 6 layers and 32 attention heads. The dimension of hidden layers and feed-forward layers are set to 512. ... The number of layers and the hidden dimension are set to 12 and 768 respectively. The number of attention heads is set to 12. The batch size is set to 32.