Hierarchical Graph Transformer with Adaptive Node Sampling
Authors: ZAIXI ZHANG, Qi Liu, Qingyong Hu, Chee-Kong Lee
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we conduct extensive experiments on real-world datasets to demonstrate the superiority of our method over existing graph transformers and popular GNNs. |
| Researcher Affiliation | Collaboration | Zaixi Zhang1,2 Qi Liu1,2 , Qingyong Hu3, Chee-Kong Lee4 1: Anhui Province Key Lab of Big Data Analysis and Application, School of Computer Science and Technology, University of Science and Technology of China 2:State Key Laboratory of Cognitive Intelligence, Hefei, Anhui, China 3:Hong Kong University of Science and Technology, 4: Tencent America |
| Pseudocode | Yes | Algorithm 1 ANS-GT Input: Total training epochs E; pmin; update period T; the number of sampled nodes N. Output: Trained Graph Transformer model, optimized wt. |
| Open Source Code | No | 1. Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [No] The code will be released once the paper is accepted. |
| Open Datasets | Yes | To comprehensively evaluate the effectiveness of ANS-GT, we conduct experiments on the six benchmark datasets including citation graphs Cora, Cite Seer, and Pub Med [18]; Wikipedia graphs Chameleon, Squirrel; the Actor co-occurrence graph [5]; and Web KB datasets [28] including Cornell, Texas, and Wisconsin. |
| Dataset Splits | Yes | We set the train-validation-test split as 60%/20%/20%. |
| Hardware Specification | Yes | All models were trained on one NVIDIA Tesla V100 GPU. |
| Software Dependencies | No | The paper mentions 'Adam W as the optimizer' but does not specify version numbers for any software dependencies like programming languages, libraries, or frameworks. |
| Experiment Setup | Yes | Implementation Details. We adopt Adam W as the optimizer and set the hyper-parameter ϵ to 1e-8 and (β1, β2) to (0.99,0.999). The peak learning rate is set to 2e-4 with a 100 epochs warm-up stage followed by a linear decay learning rate scheduler. We adopt the Variational Neighborhoods [26] with a coarsening rate of 0.01 as the default coarsening method... Parameter Settings. In the default setting, the dropout rate is set to 0.5, the end learning rate is set to 1e-9, the hidden dimension d is set to 128, the number of training epochs is set to 1,000, update period T is set to 100, N is set to 20, M is set to 10, and the number of attention head H is set as 8. We tune other hyper-parameters on each dataset based on by grid search. The searching space of batch size, number of data augmentation S, the number of layers L, number of sampled nodes, number of sampled super-nodes, number global nodes are {8, 16, 32}, {4, 8, 16, 32}, {2, 3, 4, 5, 6}, {10, 15, 20, 25}, {0, 3, 6, 9}, {1, 2, 3} respectively. |