Are GATs Out of Balance?
Authors: Nimrah Mustafa, Aleksandar Bojchevski, Rebekka Burkholz
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our study undertakes the Graph Attention Network (GAT), a popular GNN architecture in which a node s neighborhood aggregation is weighted by parameterized attention coefficients. We derive a conservation law of GAT gradient flow dynamics, which explains why a high portion of parameters in GATs with standard initialization struggle to change during training. This effect is amplified in deeper GATs, which perform significantly worse than their shallow counterparts. To alleviate this problem, we devise an initialization scheme that balances the GAT network. Our approach i) allows more effective propagation of gradients and in turn enables trainability of deeper networks, and ii) attains a considerable speedup in training and convergence time in comparison to the standard initialization. Our main theorem serves as a stepping stone to studying the learning dynamics of positive homogeneous models with attention mechanisms. Experiments on multiple benchmark datasets demonstrate that our proposal is effective in mitigating the highlighted trainability issues, as it leads to considerable training speed-ups and enables significant parameter changes across all layers. The main purpose of our experiments is to verify the validity of our theoretical insights and deduce an explanation for a major trainability issue of GATs that is amplified with increased network depth. |
| Researcher Affiliation | Academia | Nimrah Mustafa nimrah.mustafa@cispa.de Aleksandar Bojchevski a.bojchevski@uni-koeln.de Rebekka Burkholz burkholz@cispa.de CISPA Helmholtz Center for Information Security, 66123 Saarbrücken, Germany University of Cologne, 50923 Köln, Germany |
| Pseudocode | Yes | Procedure 2.6 (Balancing). Based on Eq. (5) from the norm preservation law 2.3, we note that in order to achieve balancedness, (i.e. set c = 0 in Eq.(5)), the randomly initialized parameters W l and al must satisfy the following equality for l [L]: W l[i, :] 2 al[i] 2 = W l+1[: i] 2 This can be achieved by scaling the randomly initialized weights as follows: 1. Set al = 0 for l [L]. 2. Set W 1[i, :] = W 1[i,:] W 1[i,:] βi, for i [n1] where βi is a hyperparameter 3. Set W l+1[:, i] = W l+1[:,i] W l+1[:,i] W l[i, :] for i [nl] and l [L 1] |
| Open Source Code | Yes | Our experimental code is available at https://github.com/Relational ML/GAT_Balanced_Initialization. |
| Open Datasets | Yes | We used nine common benchmark datasets for semi-supervised node classification tasks. We defer dataset details to the supplement. For the Platenoid datasets (Cora, Citeseer, and Pubmed) [56], we use the graphs provided by Pytorch Geometric (Py G)... The Web KB (Cornell, Texas, and Wisconsin), Wikipedia (Squirrel and Chameleon) and Actor datasets [39], are used from the replication package provided by [40], where duplicate edges are removed. |
| Dataset Splits | Yes | We use the standard provided train/validation/test splits and have removed the isolated nodes from Citeseer. |
| Hardware Specification | Yes | We run our experiments on either Nvidia T4 Tensor Core GPU with 15 GB RAM or Nvidia Ge Force RTX 3060 Laptop GPU with 6 GB RAM. |
| Software Dependencies | No | The paper mentions using the 'Pytorch Geometric framework' but does not specify a version number for it or any other software dependencies. Therefore, a reproducible description of ancillary software with specific version numbers is not provided. |
| Experiment Setup | Yes | For SGD, the learning rate is set to 0.1, 0.05 and 0.005 for L = [2, 5], L = [10, 20], and L = 40, respectively, allows for reasonably stable training on Cora, Citeseer, and Pubmed. For the remaining datasets, we set the learning rate to 0.05, 0.01, 0.005 and 0.0005 for L = [2, 5], L = 10, L = 20, and L = 40, respectively. We allow each network to train, both with SGD and Adam, for 5000 epochs (unless it converges earlier, i.e. achieves training loss 10 4) and select the model state with the highest validation accuracy. All reported results use Re LU activation, weight sharing and no biases, unless stated otherwise. GAT with width= 64. Adam: 0.005 for Cora and Citeseer, and 0.01 for Pubmed for the 2 and 5 layer networks. To allow stable training of deeper networks, we reduce the initial learning rate by a factor 0.1 for the 10 and 20 layer networks on all three datasets. |