Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Graph Convolutions Enrich the Self-Attention in Transformers!

Authors: Jeongwhan Choi, Hyowon Wi, Jayoung Kim, Yehjin Shin, Kookjin Lee, Nathaniel Trask, Noseong Park

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that GFSA improves the performance of Transformers in various fields, including computer vision, natural language processing, graph-level tasks, speech recognition, and code classification.
Researcher Affiliation Academia Jeongwhan Choi Yonsei University EMAIL Hyowon Wi KAIST EMAIL Jayoung Kim KAIST EMAIL Yehjin Shin KAIST EMAIL Kookjin Lee Arizona State University EMAIL Nathaniel Trask University of Pennsylvania EMAIL Noseong Park KAIST EMAIL
Pseudocode Yes For pseudocode, see Appendix I. ... Algorithm 1 Py Torch-style pseudocode for GFSA
Open Source Code Yes 1The source code of GFSA is available at: https://github.com/jeongwhanchoi/GFSA.
Open Datasets Yes We evaluate them on the GLUE benchmark... We finetune GPT2 [61] on the following 3 datasets: Penn Treebank (PTB) [47], Wiki Text-2, and Wiki Text103 [50]. We choose Dei T [74], Cai T [75], and Swin [45] as the backbone... We use datasets from Long-Range Graph Benchmark (LRGB) [21] (e.g., Peptide-func and Peptide-struct), Benchmarking GNNs [22] (e.g., ZINC, MNIST, CIFAR10), Open Graph Benchmark (OGB) dataset [32] (e.g., Molhiv and Mol Tox21), and OGB-LSC dataset (i.e., PCQM4M-LSC) [33]. We conduct automatic speech recognition (ASR) experiments on the Libri Speech 3 dataset [55]... We use Devign dataset provided by Zhou et al. [105]. We experiment with the Java data provided by Wang et al. [83].
Dataset Splits Yes For MNLI task, we experiment on both the matched (MNLI-m) and mismatched (MNLI-mm) versions. ... The standard Libri Speech validation sets (dev-clean and dev-other) are used to tune all parameters and select the best models.
Hardware Specification Yes All models are trained on 1 GPU and of NVIDIA RTX A5000 24GB. ... All the experiments are conducted on 1 GPU and of NVIDIA RTX 3090 24GB. ... All models are trained on NVIDIA RTX 3090 24GB. ... All models are trained on 4 GPUs and of NVIDIA RTX A6000 48GB. ... All models are trained on 1 GPU and of NVIDIA RTX A6000 48GB.
Software Dependencies No For implementation, we adopt Hugging Face framework. ... Our code is implemented based on the timm library [86]. ... For implementation, we use a Speech Brain [65] framework. ... We build our experiments on top of the open-sourced code and recipes provided by Wang et al. [84]. The paper mentions software tools and frameworks but does not provide specific version numbers for them.
Experiment Setup Yes We trained all models with 5 epochs with 32 batch size. The linear learning rate decay is used and initial learning rate is set to 2e-5. We use Adam W [46] optimizer, and weight decay is set to 0. ... We finetune GPT2 with 4 batch size, 5e-5 learning rate and linear learning weight decay using adam W [46] optimizer. We also apply dropout with probability 0.1. ... We set the dropout rate to 0 and 0.2 for 12-layer and 24-layer Dei T, respectively. ... We train the pure Transformer for 100 epochs and the Branchformer for 120 epochs with a batch size of 16. We use a data augmentation method on all models using Spec Augment [56]. ... We use adam W [46] optimizer with 0.9 and 0.999 coefficients for running averages of gradient and its square, and use Mean Absolute Error (MAE) as loss function. We use polynomial learning rate decay, with initial learning rate set to 2e-4 and end learning rate set to 1e-9. For ZINC, we set batch size as 256, max epochs as 10k, and warm-up stage step as 40k.