Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Statistical Inference on Random Dot Product Graphs: a Survey

Authors: Avanti Athreya, Donniell E. Fishkind, Minh Tang, Carey E. Priebe, Youngser Park, Joshua T. Vogelstein, Keith Levin, Vince Lyzinski, Yichen Qin, Daniel L Sussman

JMLR 2017 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We investigate several real-world applications, including community detection and classification in large social networks and the determination of functional and biologically relevant network properties from an exploratory data analysis of the Drosophila connectome. ... In Section 6, we discuss applications of these results to real data. ... Figure 1: Plot of the estimated latent positions in a two-block stochastic blockmodel graph on n vertices. ... Figure 2: Comparison of classification error for Gaussian mixture model (red curve), KMeans (green curve), and Bayes classifier (cyan curve). ... Figure 5: Mean squared error (MSE) in recovery of latent positions (up to rotation) in a 2-graph joint RDPG model as a function of the number of vertices. ... Figure 6: Power of the ASE-based (blue) and omnibus-based (green) tests to detect when the two graphs being testing differ in (a) one, (b) five, and (c) ten of their latent positions. ... Figure 7: Matrix of p-values (uncorrected) for testing the hypothesis H0 X =W Y for the 42 × 41/2 pairs of graphs generated from the KKI test-retest dataset Landman et al. (2011). ... Figure 9: Heat map depiction of the level one Friendster estimated dissimilarity matrix S ∈ R15×15. ... Figure 11: Illustration of the larval Drosophila mushroom body connectome as a directed graph on four neuron types. ... Figure 12: Observed data for our MB connectome as a directed adjacency matrix on four neuron types with 213 vertices ... Figure 13: Plot for the clustered embedding of our MB connectome in the Out1 vs. Out2 dimensions. ... Figure 14: Model Selection: embedding dimension ˆd = 6 ... Figure 15: Model Selection: mixture complexity ˆK = 6 is chosen by BIC. ... Figure 16: The multiple clusters for the KC neurons are capturing neuron age. ... Figure 19: Relationship between number of claws and distance δi (a proxy for age) for the KC neurons, from Eichler et al. (2017). ... Figure 20: Projection of KC neurons onto the quadratic curve CKC, yielding projection point ti for each neuron. ... Figure 21: The correlation between the projection points ti on the quadratic curve CKC and distance δi (a proxy for age) for the KC neurons is highly significant...
Researcher Affiliation Academia Avanti Athreya EMAIL Donniell E. Fishkind EMAIL Minh Tang EMAIL Carey E. Priebe EMAIL Department of Applied Mathematics and Statistics Johns Hopkins University Baltimore, MD, 2128, USA Youngser Park EMAIL Center for Imaging Science Johns Hopkins University Baltimore, MD, 21218, USA Joshua T. Vogelstein EMAIL Department of Biomedical Engineering Johns Hopkins University Baltimore, MD, 21218, USA Keith Levin EMAIL Department of Statistics University of Michigan Ann Arbor, MI, 48109, USA Vince Lyzinski EMAIL Department of Mathematics and Statistics University of Massachusetts Amherst, MA, 01003-9305, USA Yichen Qin EMAIL Department of Operations, Business Analytics, and Information Systems, College of Business University of Cincinnati Cincinnati, OH, 45221-0211, USA Daniel L Sussman EMAIL Department of Mathematics and Statistics Boston University Boston, MA, 02215, USA
Pseudocode Yes Algorithm 1 Bootstrapping procedure for the test H0 X =W Y. ... Algorithm 2 Detecting hierarchical structure for graphs
Open Source Code Yes Data and code for all our analyses are available at http://www.cis.jhu.edu/~parky/MBstructure.html.
Open Datasets Yes We consider neural imaging graphs obtained from the test-retest diffusion MRI and magnetization-prepared rapid acquisition gradient echo (MPRAGE) data of Landman et al. (2011). ... The Friendster social network contains roughly 60 million users and 2 billion connections/edges. ... Our MB connectome was obtained via serial section transmission electron microscopy of an entire larval Drosophila nervous system (Ohyama et al., 2015; Schneider Mizell et al., 2016). This connectome contains the entirety of MB intrinsic neurons, called Kenyon cells, and all of their pre- and post-synaptic partners (Eichler et al., 2017). ... (This data matrix is available at <http://www.cis.jhu.edu/~parky/MBstructure.html>.)
Dataset Splits No The paper mentions analyzing the 'test-retest' data, 'Friendster social network', and 'Drosophila connectome'. While these are datasets, the paper does not specify any training/test/validation splits, percentages, or methodologies for partitioning the data for experimental reproduction.
Hardware Specification No The paper does not provide specific hardware details such as CPU or GPU models, or memory amounts used for running its experiments. It mentions using 'Flash Graph' for scalability but does not specify the underlying hardware.
Software Dependencies No The paper mentions using 'MCLUST implementation of (Fraley and Raftery, 1999)', the 'Flash Graph (Zheng et al., 2015)', and 'MCLUST algorithm of Fraley and Raftery (2002), as implemented in R'. However, specific version numbers for these software packages or R itself are not provided.
Experiment Setup Yes There are two model selection problems inherent in spectral clustering in general, and in obtaining our clustered embedding (Figure 13) in particular: choice of embedding dimension (ˆd), and choice of mixture complexity (ˆK). ... Using the profile likelihood SVT method of Zhu and Ghodsi (2006) yields a cut-off at three singular values, as depicted in Figure 14. ... results in ˆd = 6. Similarly, a ubiquitous and principled method for choosing the number of clusters in, for example, Gaussian mixture models, is to maximize a fitness criterion penalized by model complexity. Common approaches include Akaike Information Criterion (AIC) (Akaike, 1974), Bayesian Information Criterion (BIC) (Schwarz, 1978), and Minimum Description Length (MDL) (Rissanen, 1978), to name a few. ... The MCLUST algorithm of Fraley and Raftery (2002), as implemented in R, and its associated BIC applied to our MB connectome embedded via ASE into Rˆd=6, is maximized at six clusters, as depicted in Figure 15, and hence ˆK = 6. ... Figure 16 and Table 6 use this additional neuronal information to show that the multiple clusters for the KC neurons are capturing neuron age and in a seemingly coherent geometry. ... we fit a continuous curve to (the KC subset of) the data in latent space and show that traversal of this curve corresponds monotonically to neuron age. To make this precise, we begin with a directed stochastic block model: ... testing the null hypothesis of linear against the alternative of quadratic yields clear rejection (p < 0.001), while there is insufficient evidence to favor HA cubic over H0 quadratic (p ≈ 0.1).