Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Out-of-Distribution Generalized Graph Anomaly Detection with Homophily-aware Environment Mixup

Authors: Sibo Tian, Xin Wang, Zeyang Zhang, Haibo Chen, Wenwu Zhu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we conduct various experiments to verify that our proposed method can handle structural distribution shifts in graph anomaly detection tasks by utilizing the invariant patterns in the disentangled subgraph representations. 4.1 Datasets 4.2 Baselines 4.3 Settings 4.4 Results 4.5 Ablation Study
Researcher Affiliation	Academia	Sibo Tian1, Xin Wang1 , Zeyang Zhang1, Haibo Chen1, Wenwu Zhu1 1Tsinghua University
Pseudocode	Yes	Algorithm 1 Training pipeline for HEM Require: Training epochs L, edge preservation ratio ρ 1: for l = 1, . . . , L do 2: Obtain Hego, Hne for each node as described in Section 3.1 3: Calculate classification loss L as Eq. 8 4: Generate new environment and calculate classification loss in generated environment LAUG as Eq. 11 5: Calculate diversity loss LDIV and PER loss LP ER as Eq. 12 and Eq. 13 6: Calculate inner loss Linner and outer loss Louter as Eq. 15 and Eq. 14 7: Update the homophily-aware environment mixup by minimizing inner loss 8: Update the disentagled encoder by minimizing outer loss 9: end for
Open Source Code	No	Answer: [NA] Justification: The data used in this paper are all public datasets, and it requires some time to get the code fully prepared to be released.
Open Datasets	Yes	We choose 3 commonly used GAD datasets, including Amazon, Yelp, and T-finance. We provide the details of the datasets in Appendix A.
Dataset Splits	Yes	The dataset is divided by the homophily ratio of nodes, split into 50%/10%/20%/20% for training, validation, in-distribution test, and out-of-distribution (OOD) test sets, respectively.
Hardware Specification	Yes	All experiments are conducted with: Operating System: Ubuntu 20.04.6 LTS CPU: Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz GPU: NVIDIA Ge Force RTX 4090 with 24 GB of memory
Software Dependencies	Yes	Software: Python 3.12.4; CUDA 12.2; Py Torch 2.4.1
Experiment Setup	Yes	We adopt the above three real-world datasets as node-level graph anomaly detection tasks. We transform Amazon and Yelp into homogeneous graphs by simply merging all types of edges into one type to make a comparison between homogeneous GNNs and heterogeneous GNNs. We divide each dataset into 2 domains, with/without structural distribution shift. For brevity, we denote domains with/without distribution shift as w/ DS and w/o DS separately. Specifically, we sample nodes by their homophily ratio, which means nodes with low homophily are more likely to be divided into test data( w/ DS ). Furthermore, we divide the w/o DS domain into training, validation, and test( w/o DS ) data. Since graph anomaly detection datasets are usually with severe data imbalance, we use AUPRC (Area Under the Precision-Recall Curve) and Recall@K (the recall among the top-K highest-confidence predictions) as the evaluation metrics. ... By default we choose inner loop steps = 1 and outer loop steps = 500.