Graph Mixture of Experts: Learning on Large-Scale Graphs with Explicit Diversity Modeling

Authors: Haotao Wang, Ziyu Jiang, Yuning You, Yan Han, Gaowen Liu, Jayanth Srinivasa, Ramana Kompella, Zhangyang "Atlas" Wang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The effectiveness of GMo E is validated through a series of experiments on a diverse set of tasks, including graph, node, and link prediction, using the OGB benchmark.
Researcher Affiliation Collaboration Haotao Wang1 , Ziyu Jiang2 , Yuning You2, Yan Han1, Gaowen Liu3, Jayanth Srinivasa3 Ramana Rao Kompella3, Zhangyang Wang1 1University of Texas at Austin Texas A&M University2 Cisco Systems3 {htwang, yh9442, atlaswang}@utexas.edu, {jiangziyu, yuning.you}@tamu.edu, {gaoliu, jasriniv, rkompell}@cisco.com
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Our code is publicly available at https: //github.com/VITA-Group/Graph-Mixture-of-Experts.
Open Datasets Yes We conduct experiments on ten graph datasets in the OGB benchmark [48], including graph-level (i.e., ogbg-bbbp, ogbg-hiv, ogbg-moltoxcast, ogbg-moltox21, ogbg-molesol, and ogbg-freesolv), node-level (i.e., ogbn-protein, ogbn-arxiv), and link-level prediction (i.e., ogbl-ddi, ogbl-ppa) tasks.
Dataset Splits Yes The hyper-parameter values achieving the best performance on validation sets are selected to report results on test sets, following the routine in [48].
Hardware Specification Yes In practice, on our NVIDIA A6000 GPU, the inference times for 10, 000 samples are 30.2 10.6ms for GCN-Mo E and 36.3 17.2ms for GCN.
Software Dependencies No The paper refers to using GCN and GIN models and other methodologies (e.g., Graph MAE [10]), but it does not specify version numbers for general software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes All model settings (e.g., number of layers, hidden feature dimensions, etc.) and training hyper-parameters (e.g., learning rates, training epochs, batch size, etc.) are identical as those in [48]. All three hyper-parameters n, m, k, together with the loss trade-off weight λ in Eq. (9), are tuned by grid searching: n {4, 8}, m {0, n/2, n}, k {1, 2, 4}, and λ {0.1, 1}. For training hyperparameters, we employ a batch size of 1024 to accelerate the training on the large pre-train dataset for both baselines and the proposed method. We follow [10] employing GIN [13] as the backbone, 0.001 as the learning rate, adam as the optimizer, 0 as the weight decay, 100 as the training epochs number, and 0.25 as the masking ratio.