Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

A CLT for Polynomial GNNs on Community-Based Graphs

Authors: Luciano Vinas, Arash Amini

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	We consider the empirical distribution of the embeddings of a k-layer polynomial GNN on a semi-supervised node classification task and prove a central limit theorem for them. Assuming a community based model for the underlying graph, with growing average degree ͵n , we show that the empirical distribution of the centered features, when scaled by ͵k 1/2 n converge in 1-Wasserstein distance to a centered stable mixture of multivariate normal distributions. ... Our results provide a precise and nuanced lens on how oversmoothing presents itself in the large graph limit, in the sparse regime. ... No benchmarking was done for this paper. Figures provided are for visual aid. No tables or statistical tests were provided.
Researcher Affiliation	Academia	Luciano Vinas University of California, Los Angeles EMAIL Arash A. Amini University of California, Los Angeles EMAIL
Pseudocode	No	The paper primarily presents theoretical derivations, proofs, and analyses of GNN embeddings. It does not include any explicitly labeled pseudocode or algorithm blocks. Methods are described in prose.
Open Source Code	No	Code is not instrumental to understanding our result. Plots are supplementary to the theoretical results shown in this paper.
Open Datasets	No	We assume the graph and its node features are generated from a community-based model. Let z = (zi)n i=1 [L]n be a vector of latent node labels, assigning each node i to one of L communities or classes. ... Specifically, we adopt the Contextual Stochastic Block Model (CSBM) [10].
Dataset Splits	No	The paper uses synthetic data generated from models like the Contextual Stochastic Block Model (CSBM) and Erd'os-R'enyi graphs for its simulations. It describes the parameters for generating these graphs (e.g., number of nodes, class proportions, average degree) for the figures (e.g., G.2 Details for Figure 2: "3-class CSBM with n = 8192 nodes. Class proportions were ͳ1 = 0.25, ͳ2 = 0.45, ͳ3 = 0.30"), but it does not involve splitting an existing, pre-collected dataset into training, validation, and test sets. Therefore, dataset split information is not applicable in the context of this theoretical work with simulations.
Hardware Specification	Yes	Computer resources included one local machine with 64Gb of RAM and a Nvidia 4090 GPU.
Software Dependencies	No	The paper does not explicitly mention any specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, or specific solvers).
Experiment Setup	Yes	G.2 Details for Figure 2: The plots in Figure 2 were generated using a 3-class CSBM with n = 8192 nodes. Class proportions were ͳ1 = 0.25, ͳ2 = 0.45, ͳ3 = 0.30, average degree parameter was ͵n = 8192, and the inter-community probability scaling matrix was B = (͵n/n) 0.4 1 1 1 0.4 1 1 1 0.4 . Initial features Xi where d = 2 dimensional and generated as Xi ∳ N(Mzi, , ̄2I2) with ̄2 = 0.25 and M1, = [2, 2]T , M2, = [−1, 3]T , and M3, = [−1, 0]T . Cross entropy training was run for a single linear classifier layer for 10 epochs with learning rate 10 on the SGD optimization.