Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Towards Graph Foundation Models: Learning Generalities Across Graphs via Task-Trees

Authors: Zehong Wang, Zheyuan Zhang, Tianyi Ma, Nitesh V Chawla, Chuxu Zhang, Yanfang Ye

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To empirically validate our theoretical insights, we introduce Graph Generality Identifier on Task-Trees (GIT), a graph foundation model that demonstrates strong performance on over 30 graphs across five domains via finetuning, in-context learning, and zero-shot generalization. Extensive experiments across 32 graphs and five domains validate the effectiveness of GIT under finetuning, in-context, and zero-shot settings.
Researcher Affiliation Academia 1University of Notre Dame 2University of Connecticut. Correspondence to: Zehong Wang <EMAIL>, Yanfang Ye <EMAIL>.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks. Methods are described in prose and mathematical formulations.
Open Source Code Yes Code and data are available at https: //github.com/Zehong-Wang/GIT.
Open Datasets Yes Datasets. We conduct experiments on over 30 text-attributed graphs spanning five domains: academic networks, e-commerce networks, knowledge graphs, molecular graphs, and temporal graphs. Pretraining is performed on a diverse subset including Arxiv (academic), Products (e-commerce), WN18RR and FB15K237 (knowledge), and Chemblpre and PCBA (molecular). Specialization is evaluated on representative datasets for each domain: Arxiv, Products, FB15K237, and PCBA. For temporal graphs, which are e-commerce temporal graphs, we also use Products for SFT to assess robustness under temporal distribution shifts. We provide the full dataset details in Appendix E.1. Specifically, we follow Liu et al. (2024a) and encode all node features into a shared 768-dimensional embedding space using Sentence-BERT (Reimers & Gurevych, 2019). We utilize 32 datasets spanning five domains in this paper. Since these datasets are text-attributed graphs, we use Sentence-BERT (Reimers & Gurevych, 2019) to align the node textual features into 768-dimensional vectors. The dataset statistics are presented in Table 8. For the temporal graphs, we split each graph into 10 snapshots, with the statistics shown in Figure 8.
Dataset Splits Yes Splitter. For each dataset, we use the same splitting strategy as provided in the original paper (Chen et al., 2024b; Galkin et al., 2024; Feng et al., 2024; Zhang et al., 2024b). If multiple splits are provided, we evaluate model performance on each split using different random seeds. For datasets with a single split, we repeat the experiments five times with different random seeds. For GDELT and ICEWS1819, which are originally temporal knowledge graphs, we apply an 80%/10%/10% split based on timestamps for train/validation/test settings. For the temporal graphs Enron and Googlemap CT used for edge classification, we split each snapshot by timestamps, using the first 70% for training, the next 15% for validation, and the remaining 15% for testing.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. While an efficiency analysis is included with time consumption metrics, the hardware on which these measurements were conducted or for the main experiments is not specified.
Software Dependencies No The paper mentions using Sentence-BERT (Reimers & Gurevych, 2019) for encoding node features and Graph SAGE (Hamilton et al., 2017) as the backbone encoder. However, it does not provide specific version numbers for these libraries or any other software dependencies like Python, PyTorch, or CUDA, which are necessary for replication.
Experiment Setup Yes The model architecture and pretraining parameters of our GIT are presented in Table 9. The specific fine-tuning hyperparameters, categorized by domain, are shown in Tables 10, 11, 12, 13, and 14. We observe that increasing the number of hidden dimensions from 128 to 2,048 significantly improves model performance across all domains. For the baseline methods, we follow the hyperparameters reported in (Liu et al., 2024a; Chen et al., 2024b). If the hyperparameters are not provided, we set the number of epochs to 1,000, the batch size to 4,096, early stopping at 200, and the hidden dimension to 768, using a 2-layer Graph SAGE as the backbone with batch normalization and ReLU activation. For optimization, we use AdamW with a weight decay of 1e-6 and tune the learning rate from 1e-3, 1e-4, 1e-5, reporting the best performance. For methods with attention mechanisms, we set 4 attention heads.