Graph Convolutions Enrich the Self-Attention in Transformers!

Authors: Jeongwhan Choi, Hyowon Wi, Jayoung Kim, Yehjin Shin, Kookjin Lee, Nathaniel Trask, Noseong Park

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that GFSA improves the performance of Transformers in various fields, including computer vision, natural language processing, graph-level tasks, speech recognition, and code classification.
Researcher Affiliation Academia Jeongwhan Choi Yonsei University jeongwhan.choi@yonsei.ac.kr Hyowon Wi KAIST hyowon.wi@kaist.ac.kr Jayoung Kim KAIST jayoung.kim@kaist.ac.kr Yehjin Shin KAIST yehjin.shin@kaist.ac.kr Kookjin Lee Arizona State University kookjin.lee@asu.edu Nathaniel Trask University of Pennsylvania ntrask@seas.upenn.edu Noseong Park KAIST noseong@kaist.ac.kr
Pseudocode Yes For pseudocode, see Appendix I. ... Algorithm 1 Py Torch-style pseudocode for GFSA
Open Source Code Yes 1The source code of GFSA is available at: https://github.com/jeongwhanchoi/GFSA.
Open Datasets Yes We evaluate them on the GLUE benchmark... We finetune GPT2 [61] on the following 3 datasets: Penn Treebank (PTB) [47], Wiki Text-2, and Wiki Text103 [50]. We choose Dei T [74], Cai T [75], and Swin [45] as the backbone... We use datasets from Long-Range Graph Benchmark (LRGB) [21] (e.g., Peptide-func and Peptide-struct), Benchmarking GNNs [22] (e.g., ZINC, MNIST, CIFAR10), Open Graph Benchmark (OGB) dataset [32] (e.g., Molhiv and Mol Tox21), and OGB-LSC dataset (i.e., PCQM4M-LSC) [33]. We conduct automatic speech recognition (ASR) experiments on the Libri Speech 3 dataset [55]... We use Devign dataset provided by Zhou et al. [105]. We experiment with the Java data provided by Wang et al. [83].
Dataset Splits Yes For MNLI task, we experiment on both the matched (MNLI-m) and mismatched (MNLI-mm) versions. ... The standard Libri Speech validation sets (dev-clean and dev-other) are used to tune all parameters and select the best models.
Hardware Specification Yes All models are trained on 1 GPU and of NVIDIA RTX A5000 24GB. ... All the experiments are conducted on 1 GPU and of NVIDIA RTX 3090 24GB. ... All models are trained on NVIDIA RTX 3090 24GB. ... All models are trained on 4 GPUs and of NVIDIA RTX A6000 48GB. ... All models are trained on 1 GPU and of NVIDIA RTX A6000 48GB.
Software Dependencies No For implementation, we adopt Hugging Face framework. ... Our code is implemented based on the timm library [86]. ... For implementation, we use a Speech Brain [65] framework. ... We build our experiments on top of the open-sourced code and recipes provided by Wang et al. [84]. The paper mentions software tools and frameworks but does not provide specific version numbers for them.
Experiment Setup Yes We trained all models with 5 epochs with 32 batch size. The linear learning rate decay is used and initial learning rate is set to 2e-5. We use Adam W [46] optimizer, and weight decay is set to 0. ... We finetune GPT2 with 4 batch size, 5e-5 learning rate and linear learning weight decay using adam W [46] optimizer. We also apply dropout with probability 0.1. ... We set the dropout rate to 0 and 0.2 for 12-layer and 24-layer Dei T, respectively. ... We train the pure Transformer for 100 epochs and the Branchformer for 120 epochs with a batch size of 16. We use a data augmentation method on all models using Spec Augment [56]. ... We use adam W [46] optimizer with 0.9 and 0.999 coefficients for running averages of gradient and its square, and use Mean Absolute Error (MAE) as loss function. We use polynomial learning rate decay, with initial learning rate set to 2e-4 and end learning rate set to 1e-9. For ZINC, we set batch size as 256, max epochs as 10k, and warm-up stage step as 40k.