Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Asymptotics of Network Embeddings Learned via Subsampling
Authors: Andrew Davison, Morgane Austern
JMLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5. Experiments We perform experiments1 on both simulated and real data, illustrating the validity of our theoretical results. We also highlight that the use of a Krein inner product ω, diag(Ip, Iq)ω between embedding vectors can lead to improved performance when using the learned embeddings for downstream tasks. To do so, we will consider a semi-supervised multi-label node classification task on two different data sets: a proteinprotein interaction network (Grover and Leskovec, 2016; Breitkreutz et al., 2008) with 3,890 vertices, 76,583 edges and 50 classes; and the Blog Catalog data set (Tang and Liu, 2009) with 10,312 vertices, 333,983 edges and 39 classes. |
| Researcher Affiliation | Academia | Andrew Davison EMAIL Department of Statistics Columbia University New York, NY 10027-5927, USA Morgane Austern EMAIL Department of Statistics Harvard University Cambridge, MA 02138-2901, USA |
| Pseudocode | Yes | Algorithm 1 (Uniform vertex sampling) Given a graph Gn and number of samples k, we select k vertices from Gn uniformly and without replacement, and then return S(Gn) as the induced subgraph using these sampled vertices. ... Algorithm 4 (Random walk sampling with unigram negative sampling) Given a graph Gn, a walk length k, number of negative samples l per positively sampled vertex, unigram parameter α and an initial distribution π0( | Gn), we |
| Open Source Code | Yes | 1. Code is available at https://github.com/aday651/embed-asym-experiments. |
| Open Datasets | Yes | a protein-protein interaction network (Grover and Leskovec, 2016; Breitkreutz et al., 2008) with 3,890 vertices, 76,583 edges and 50 classes; and the Blog Catalog data set (Tang and Liu, 2009) with 10,312 vertices, 333,983 edges and 39 classes. |
| Dataset Splits | Yes | We simultaneously train a multinomial logistic regression classifier from the embedding vectors to the vertex classes, with half of the labels censored during training (to be predicted afterwards), and the normalized label loss kept at a ratio of 0.01 to that of the normalized edge logit loss. |
| Hardware Specification | No | We acknowledge computing resources from Columbia University s Shared Research Computing Facility project, which is supported by NIH Research Facility Improvement Grant 1G20RR030893-01, and associated funds from the New York State Empire State Development, Division of Science Technology and Innovation (NYSTAR) Contract C090171, both awarded April 15, 2010. |
| Software Dependencies | No | We then train each network using a constant step-size SGD method with a uniform vertex sampler for 40 epochs... We simultaneously train a multinomial logistic regression classifier from the embedding vectors to the vertex classes... |
| Experiment Setup | Yes | We then train each network using a constant step-size SGD method with a uniform vertex sampler for 40 epochs... We learn 128 dimensional embeddings of the networks using two sampling schemes... We simultaneously train a multinomial logistic regression classifier from the embedding vectors to the vertex classes, with half of the labels censored during training (to be predicted afterwards), and the normalized label loss kept at a ratio of 0.01 to that of the normalized edge logit loss. |