Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Asymptotics of Network Embeddings Learned via Subsampling

Authors: Andrew Davison, Morgane Austern

JMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	5. Experiments We perform experiments1 on both simulated and real data, illustrating the validity of our theoretical results. We also highlight that the use of a Krein inner product ω, diag(Ip, Iq)ω between embedding vectors can lead to improved performance when using the learned embeddings for downstream tasks. To do so, we will consider a semi-supervised multi-label node classiﬁcation task on two diﬀerent data sets: a proteinprotein interaction network (Grover and Leskovec, 2016; Breitkreutz et al., 2008) with 3,890 vertices, 76,583 edges and 50 classes; and the Blog Catalog data set (Tang and Liu, 2009) with 10,312 vertices, 333,983 edges and 39 classes.
Researcher Affiliation	Academia	Andrew Davison EMAIL Department of Statistics Columbia University New York, NY 10027-5927, USA Morgane Austern EMAIL Department of Statistics Harvard University Cambridge, MA 02138-2901, USA
Pseudocode	Yes	Algorithm 1 (Uniform vertex sampling) Given a graph Gn and number of samples k, we select k vertices from Gn uniformly and without replacement, and then return S(Gn) as the induced subgraph using these sampled vertices. ... Algorithm 4 (Random walk sampling with unigram negative sampling) Given a graph Gn, a walk length k, number of negative samples l per positively sampled vertex, unigram parameter α and an initial distribution π0( \| Gn), we
Open Source Code	Yes	1. Code is available at https://github.com/aday651/embed-asym-experiments.
Open Datasets	Yes	a protein-protein interaction network (Grover and Leskovec, 2016; Breitkreutz et al., 2008) with 3,890 vertices, 76,583 edges and 50 classes; and the Blog Catalog data set (Tang and Liu, 2009) with 10,312 vertices, 333,983 edges and 39 classes.
Dataset Splits	Yes	We simultaneously train a multinomial logistic regression classiﬁer from the embedding vectors to the vertex classes, with half of the labels censored during training (to be predicted afterwards), and the normalized label loss kept at a ratio of 0.01 to that of the normalized edge logit loss.
Hardware Specification	No	We acknowledge computing resources from Columbia University s Shared Research Computing Facility project, which is supported by NIH Research Facility Improvement Grant 1G20RR030893-01, and associated funds from the New York State Empire State Development, Division of Science Technology and Innovation (NYSTAR) Contract C090171, both awarded April 15, 2010.
Software Dependencies	No	We then train each network using a constant step-size SGD method with a uniform vertex sampler for 40 epochs... We simultaneously train a multinomial logistic regression classiﬁer from the embedding vectors to the vertex classes...
Experiment Setup	Yes	We then train each network using a constant step-size SGD method with a uniform vertex sampler for 40 epochs... We learn 128 dimensional embeddings of the networks using two sampling schemes... We simultaneously train a multinomial logistic regression classiﬁer from the embedding vectors to the vertex classes, with half of the labels censored during training (to be predicted afterwards), and the normalized label loss kept at a ratio of 0.01 to that of the normalized edge logit loss.