Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

GraphMaster: Automated Graph Synthesis via LLM Agents in Data-Limited Environments

Authors: Enjun Du, Xunkai Li, Tian Jin, Zhihan Zhang, Rong-Hua Li, Guoren Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To evaluate Graph Master comprehensively, we formulate four research questions: (RQ1): Can Graph Master generate high-quality text-attributed graph data in data-limited environment? (RQ2): Can the graph data synthesized by Graph Master retain the original graph features well? (RQ3): Can Graph Master maintain interpretability well? (RQ4): What is the relative contribution of each component in Graph Master to the overall synthesis quality? ... We evaluate Graph Master s ability to synthesize high-quality graph data by applying it to enhance the data-limited datasets we created and assessing whether the enhanced datasets improve downstream model performance. We employ standard metrics including Accuracy and F1 Score as evaluation criteria, with higher values indicating superior performance.
Researcher Affiliation	Academia	Enjun Du1, Xunkai Li1, Tian Jin2, Zhihan Zhang1, Rong-Hua Li1 , Guoren Wang1 1Beijing Institute of Technology 2The Hong Kong University of Science and Technology (Guangzhou)
Pseudocode	Yes	Algorithm 1 M-Preserving Graph Sampling
Open Source Code	Yes	2Code is available on https://github.com/Enjun Du/Graph Master.
Open Datasets	Yes	Our experiments utilize six widely recognized text-attributed graph datasets: Cora [32], Citeseer [13], Wikics [10], Arxiv2023 [36], and History and Children [49]. It is worth noting that in order to better simulate the data-limited environment to test the effect of data synthesis, we created 6 data-limited datasets, namely Sub Cora, Sub Citeseer, Sub Wikics, Sub History, Sub Arxiv2023, and Sub Children (details are given in Appendix C). ... After the paper is accepted, we will open source the complete data-limited dataset and its creation code...
Dataset Splits	Yes	Table 3: Dataset Statistics Dataset # Nodes # Edges # Classes # Louvain communities # Training nodes # Validation nodes # Test nodes Sub Cora 1354 2486 7 99 815 267 272
Hardware Specification	Yes	We ran the entire experiment on eight 80G A100 GPUs... We selected Qw Q-32B [37] as the large language model for these two baselines, and used two A6000 GPUs with 48G memory for the experiments.
Software Dependencies	No	In training the GNN model, we first initialized the text attributes with Sentence-BERT [35] to generate the initial features before proceeding with training.
Experiment Setup	Yes	For the background knowledge nodes, we set N = 30, and for the newly generated nodes, we configured M% = 15% (The hyperparameter selection analysis are given in Appendix E). In training the GNN model, we first initialized the text attributes with Sentence-BERT [35] to generate the initial features before proceeding with training. To ensure the robustness of our experiments, we repeated each experiment 50 times and reported the mean and standard deviation of the results. ... Appendix E: Knowledge extraction: Sample size N = 30 nodes provides sufficient context without introducing noise; Node generation: Setting M% = 15% of knowledge nodes balances quantity and quality; Community detection: Parameters µ = 0.5 and γ = 0.5 effectively balance semantic and structural factors; Stochastic sampling: β = 2.0 maintains appropriate exploration-exploitation balance; Edge formation: For semantic mode, (θ1, θ2, θ3) = (0.6, 0.3, 0.1); for topological mode, (0.2, 0.5, 0.3); Quality assessment: Initial threshold τ0 = 7.0 with adaptive update rate ζ = 0.1; Convergence criteria: ϵ = 0.05 provides sufficient refinement iterations; Objective weights: Initialize λsem = λstruct = λbal = 0.33 with learning rate η = 0.05.