Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Flatten Graphs as Sequences: Transformers are Scalable Graph Generators

Authors: Dexiong Chen, Markus Krimmel, Karsten Borgwardt

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we evaluate the performance of AUTOGRAPH on several graph generation benchmarks, including both small and large graphs, and synthetic and real-world molecular datasets. Our experiments compare its performance to several SOTA methods and particularly focus on evaluating the following aspects: (1) We show its ability to generate relatively small graphs with a 100-fold inference speedup compared to diffusion-based models while maintaining or even improving structural validity.
Researcher Affiliation	Academia	Dexiong Chen Max Planck Institute of Biochemistry Martinsried, Germany EMAIL
Pseudocode	Yes	Algorithm 1 Causal and Hamiltonian SENT Sampling
Open Source Code	Yes	Our code is available at https://github.com/Borgwardt Lab/Auto Graph.
Open Datasets	Yes	Small synthetic graphs: Planar and SBM. Both of these datasets are from Martinkus et al. [45]. ... Large graphs: Proteins and Point Clouds. The Proteins dataset includes graph representations (contact maps) of proteins from Dobson and Doig [20]. ... QM9. The QM9 dataset, from Wu et al. [68]. ... MOSES and Guaca Mol. The MOSES and Guaca Mol datasets are obtained from the respective benchmark tools of Polykovskiy et al. [53] and Brown et al. [6]. ... Pub Chem-10M. Pub Chem-10M is a subset of about 10M molecules from Pub Chem curated by Chithrananda et al. [14].
Dataset Splits	Yes	We adopt the standard train/validation/test splits provided in the original sources. The statistics about the datasets are summarized in Table 8.
Hardware Specification	Yes	Experiments were conducted on a shared computing cluster with various CPU and GPU configurations, including 16 NVIDIA H100 (80GB) GPUs. Each experiment was allocated resources on a single GPU, along with 8 CPUs and up to 48GB of system RAM. The run-time of each model was measured on a single NVIDIA H100 GPU.
Software Dependencies	No	Our implementation leverages the Hugging Face framework [31], providing users with a flexible interface to experiment with SOTA language models for graph generation. ... We employ the Adam W optimizer with a gradient clipping threshold of 1.0, a weight decay of 0.1, and a learning rate schedule with a linear warmup followed by cosine decay, peaking at 6e-4.
Experiment Setup	Yes	We maintain a consistent model architecture and size throughout all experiments, specifically using the small GPT configuration (768 hidden dimensions, 12 layers, 12 attention heads). ... We fix the context length to 2048 and use a batch size of 128 if possible, otherwise 64 for larger graphs. In particular, we employ the Adam W optimizer with a gradient clipping threshold of 1.0, a weight decay of 0.1, and a learning rate schedule with a linear warmup followed by cosine decay, peaking at 6e-4. The Adam W hyperparameters are set to β = (0.9, 0.95). ... Each model was trained for 200000, 400000, or 800000 iterations, depending on the dataset size.