Online GNN Evaluation Under Test-time Graph Distribution Shifts

Authors: Xin Zheng, Dongjin Song, Qingsong Wen, Bo Du, Shirui Pan

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on real-world test graphs under diverse graph distribution shifts could verify the effectiveness of the proposed method, revealing its strong correlation with ground-truth test errors on various well-trained GNN models.
Researcher Affiliation Collaboration Xin Zheng Monash University Melbourne, Australia xin.zheng@monash.edu Dongjin Song University of Connecticut Storrs, USA dongjin.song@uconn.edu Qingsong Wen Squirrel AI Bellevue, USA qingsongedu@gmail.com Bo Du Wuhan University Wuhan, China dubo@whu.edu Shirui Pan Griffith University Queensland, Australia s.pan@griffith.edu.au
Pseudocode Yes Algorithm 1 Learning Behavior Discrepancy (LEBED) Score Computation.
Open Source Code Yes 1Code is available at https://github.com/Amanda-Zheng/LEBED
Open Datasets Yes We perform experiments on six real-world graph datasets with diverse graph data distribution shifts containing: node feature shifts (Wu et al., 2022; Jin et al., 2023b)), domain shifts (Wu et al., 2020), temporal shifts (Wu et al., 2022). Detailed statistics of all these datasets are listed in Table A1 in Appendix B.
Dataset Splits Yes For all training graphs and validation graphs, we follow the process procedures and splits in works (Wu et al., 2022) and (Wu et al., 2020).
Hardware Specification Yes The running time comparison on Citationv2 in seconds is shown in Fig. 3 with a single Ge Force RTX 3080 GPU and 200 iterations for w/ Dstru..
Software Dependencies No In our experiments, we use Pytorch geometric library (Fey & Lenssen, 2019) and four Ge Force RTX 3080 GPUs for all implementations. However, specific version numbers for the software dependencies are not provided.
Experiment Setup Yes More details of these well-trained GNN models, including architectures, training hyper-parameters, and groundtruth test error distributions, are provided in Appendix D. We report the correlation between the proposed LEBED and the ground-truth test errors under unseen and unlabeled test graphs with distribution shifts, using R2 and rank correlation Spearman s ρ, where R2 ranges [0, 1], representing the degree of linear fit between two variables. The closer it is to 1, the higher the linear correlation. Spearman s ρ ranges [ 1, 1], representing the monotonic correlation between two variables with 1 indicating the positive correlation and 1 indicating the negative correlation.