Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Flatten Graphs as Sequences: Transformers are Scalable Graph Generators

Authors: Dexiong Chen, Markus Krimmel, Karsten Borgwardt

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we evaluate the performance of AUTOGRAPH on several graph generation benchmarks, including both small and large graphs, and synthetic and real-world molecular datasets. Our experiments compare its performance to several SOTA methods and particularly focus on evaluating the following aspects: (1) We show its ability to generate relatively small graphs with a 100-fold inference speedup compared to diffusion-based models while maintaining or even improving structural validity.
Researcher Affiliation Academia Dexiong Chen Max Planck Institute of Biochemistry Martinsried, Germany EMAIL
Pseudocode Yes Algorithm 1 Causal and Hamiltonian SENT Sampling
Open Source Code Yes Our code is available at https://github.com/Borgwardt Lab/Auto Graph.
Open Datasets Yes Small synthetic graphs: Planar and SBM. Both of these datasets are from Martinkus et al. [45]. ... Large graphs: Proteins and Point Clouds. The Proteins dataset includes graph representations (contact maps) of proteins from Dobson and Doig [20]. ... QM9. The QM9 dataset, from Wu et al. [68]. ... MOSES and Guaca Mol. The MOSES and Guaca Mol datasets are obtained from the respective benchmark tools of Polykovskiy et al. [53] and Brown et al. [6]. ... Pub Chem-10M. Pub Chem-10M is a subset of about 10M molecules from Pub Chem curated by Chithrananda et al. [14].
Dataset Splits Yes We adopt the standard train/validation/test splits provided in the original sources. The statistics about the datasets are summarized in Table 8.
Hardware Specification Yes Experiments were conducted on a shared computing cluster with various CPU and GPU configurations, including 16 NVIDIA H100 (80GB) GPUs. Each experiment was allocated resources on a single GPU, along with 8 CPUs and up to 48GB of system RAM. The run-time of each model was measured on a single NVIDIA H100 GPU.
Software Dependencies No Our implementation leverages the Hugging Face framework [31], providing users with a flexible interface to experiment with SOTA language models for graph generation. ... We employ the Adam W optimizer with a gradient clipping threshold of 1.0, a weight decay of 0.1, and a learning rate schedule with a linear warmup followed by cosine decay, peaking at 6e-4.
Experiment Setup Yes We maintain a consistent model architecture and size throughout all experiments, specifically using the small GPT configuration (768 hidden dimensions, 12 layers, 12 attention heads). ... We fix the context length to 2048 and use a batch size of 128 if possible, otherwise 64 for larger graphs. In particular, we employ the Adam W optimizer with a gradient clipping threshold of 1.0, a weight decay of 0.1, and a learning rate schedule with a linear warmup followed by cosine decay, peaking at 6e-4. The Adam W hyperparameters are set to β = (0.9, 0.95). ... Each model was trained for 200000, 400000, or 800000 iterations, depending on the dataset size.