Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

The Underappreciated Power of Vision Models for Graph Structural Understanding

Authors: Xinjian Zhao, Wei Pang, Zhongkai Xue, Xiangru Jian, Lei Zhang, Yaoyao Xu, Xiaozhuang Song, Shu Wu, Tianshu Yu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our evaluations demonstrate that pure vision encoders perform comparably to specialized GNNs on established graph benchmarks... Our experiments yield several key findings: On tasks requiring the abstraction of global graph properties, vision models demonstrate significant advantages and superior generalization capabilities... Table 1: Performance comparison on different datasets. Results show the accuracy (%) of different models, reported as mean ± std over 5 runs.
Researcher Affiliation	Academia	School of Data Science, The Chinese University of Hong Kong, Shenzhen Institute of Automation, Chinese Academy of Sciences Cheriton School of Computer Science, University of Waterloo EMAIL, EMAIL EMAIL EMAIL, EMAIL
Pseudocode	Yes	E.4 Algorithmic Implementation The implementation of graph coverings in our code precisely follows the mathematical constructions in the above definitions: Algorithm 1 Generate Bipartite Double Cover ... Algorithm 2 Generate k-fold Cyclic Cover from Real-world Network
Open Source Code	Yes	The code is available at https://github.com/LOGO-CUHKSZ/Graph Abstract
Open Datasets	Yes	Traditional benchmarks in domains like molecular prediction, citation networks, and protein interaction graphs inadvertently couple domain-specific node features with topology... To enhance diversity and realism in our generated graphs, we extracted a collection of base graphs from real-world datasets. For MUTAG, we directly utilized the molecular graphs. For Cora, which is a large citation network... [47] Tudataset: A collection of benchmark datasets for learning with graphs. ar Xiv preprint ar Xiv:2007.08663, 2020.
Dataset Splits	Yes	Our evaluation includes three test settings of increasing difficulty: ID (In-Distribution) setting uses test graphs containing 20-50 nodes, matching the training distribution. Near-OOD (Near Out-of-Distribution) setting contains graphs with 40-100 nodes, representing a moderate scale shift. Far-OOD (Far Out-of-Distribution) setting features graphs with 60-150 nodes, constituting a significant scale shift. Table 4: Dataset statistics across our four benchmark tasks. Each cell shows the number of graphs followed by the node count range in parentheses. Split Topology Symmetry Spectral Gap Bridge Count Train 3000 2000 3000 2500 (20-50) (30-60) (20-50) (20-50) Val 300 200 300 250 (20-50) (30-60) (20-50) (20-50) Test (ID) 300 600 300 250 (20-50) (30-60) (20-50) (20-50) Test (Near-OOD) 300 600 300 250 (40-100) (50-100) (40-100) (40-100) Test (Far-OOD) 300 600 300 250 (60-150) (70-150) (60-150) (60-150)
Hardware Specification	Yes	All experiments are conducted on 4 NVIDIA A800 GPUs.
Software Dependencies	No	All datasets are implemented using Py Torch Geometric. We use the Adam optimizer. We use the pynauty1 library to verify the symmetry. The specific version numbers for these software components are not provided.
Experiment Setup	Yes	All models are trained with a batch size of 128 for a maximum of 200 epochs, employing early stopping with a patience of 30 epochs to prevent overfitting. We use the Adam optimizer with different learning rates: 1e-5 for vision backbone parameters, 1e-3 for GNN models and classifier heads. Weight decay is set to 1e-4 for vision models. For classification tasks (Topology, Symmetry), we use cross-entropy loss, while for regression tasks (Spectral Gap, Bridge Counting), we employ mean squared error loss. All experiments are conducted on 4 NVIDIA A800 GPUs. For consistent evaluation, we measure accuracy for classification tasks, while regression tasks use Mean Absolute Error (MAE). To ensure reproducibility, we set fixed random seeds [0, 1, 2, 3, 4] for all experiments, controlling the initialization of model parameters, data splitting. For our graph neural network models, we experiment with varying numbers of layers ranging from 2 to 4, with a consistent hidden dimension size of 128 across all architectures. Dropout with a rate of 0.5 is applied throughout the networks to prevent overfitting. For vision-based models, we use standard architectures: Res Net-50, Vi T-B/16, Swin Transformer-Tiny, and Conv Ne Xt V2-Tiny. All models resize graph images to 224 x 224 resolution as input.