Graph Mixture of Experts: Learning on Large-Scale Graphs with Explicit Diversity Modeling
Authors: Haotao Wang, Ziyu Jiang, Yuning You, Yan Han, Gaowen Liu, Jayanth Srinivasa, Ramana Kompella, Zhangyang "Atlas" Wang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The effectiveness of GMo E is validated through a series of experiments on a diverse set of tasks, including graph, node, and link prediction, using the OGB benchmark. |
| Researcher Affiliation | Collaboration | Haotao Wang1 , Ziyu Jiang2 , Yuning You2, Yan Han1, Gaowen Liu3, Jayanth Srinivasa3 Ramana Rao Kompella3, Zhangyang Wang1 1University of Texas at Austin Texas A&M University2 Cisco Systems3 {htwang, yh9442, atlaswang}@utexas.edu, {jiangziyu, yuning.you}@tamu.edu, {gaoliu, jasriniv, rkompell}@cisco.com |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is publicly available at https: //github.com/VITA-Group/Graph-Mixture-of-Experts. |
| Open Datasets | Yes | We conduct experiments on ten graph datasets in the OGB benchmark [48], including graph-level (i.e., ogbg-bbbp, ogbg-hiv, ogbg-moltoxcast, ogbg-moltox21, ogbg-molesol, and ogbg-freesolv), node-level (i.e., ogbn-protein, ogbn-arxiv), and link-level prediction (i.e., ogbl-ddi, ogbl-ppa) tasks. |
| Dataset Splits | Yes | The hyper-parameter values achieving the best performance on validation sets are selected to report results on test sets, following the routine in [48]. |
| Hardware Specification | Yes | In practice, on our NVIDIA A6000 GPU, the inference times for 10, 000 samples are 30.2 10.6ms for GCN-Mo E and 36.3 17.2ms for GCN. |
| Software Dependencies | No | The paper refers to using GCN and GIN models and other methodologies (e.g., Graph MAE [10]), but it does not specify version numbers for general software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | All model settings (e.g., number of layers, hidden feature dimensions, etc.) and training hyper-parameters (e.g., learning rates, training epochs, batch size, etc.) are identical as those in [48]. All three hyper-parameters n, m, k, together with the loss trade-off weight λ in Eq. (9), are tuned by grid searching: n {4, 8}, m {0, n/2, n}, k {1, 2, 4}, and λ {0.1, 1}. For training hyperparameters, we employ a batch size of 1024 to accelerate the training on the large pre-train dataset for both baselines and the proposed method. We follow [10] employing GIN [13] as the backbone, 0.001 as the learning rate, adam as the optimizer, 0 as the weight decay, 100 as the training epochs number, and 0.25 as the masking ratio. |