Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Approximate Information Tests on Statistical Submanifolds

Authors: Michael W. Trosset, Carey E. Priebe

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Examples illustrate the eﬃcacy of the proposed methodology. Keywords: Restricted Inference, Dimension Reduction, Information Geometry, Minimum Distance Test. Section 7 reports a small simulation study designed to explore the eﬀect of sampling density on performance.
Researcher Affiliation	Academia	Michael W. Trosset EMAIL Department of Statistics Indiana University Bloomington, IN 47408, USA Carey E. Priebe EMAIL Department of Applied Mathematics & Statistics Johns Hopkins University Baltimore, MD 21218-2682, USA
Pseudocode	Yes	Figure 1: An approximate information test for the case of an unknown submodel that can be sampled. Steps 2 4 are essentially isomap (Tenenbaum et al., 2000), used here to represent the Riemannian structure of a statistical manifold rather than a data manifold. Details are provided in Section 6.
Open Source Code	No	No explicit statement about the release of source code for the methodology described in this paper is found.
Open Datasets	No	The paper describes experiments based on statistical models (multinomial and trinomial distributions) and simulated data (e.g., 'o = (3, 5, 4, 6, 9, 2, 1)' in the Motivating Example, and generating 'τ1, . . . , τ100 Uniform[0, π/2]2' in Example 4), but does not use or provide concrete access information for a publicly available or open dataset.
Dataset Splits	No	The paper describes generating simulated random samples from hypothesized distributions for significance probability estimation and power analysis (e.g., 'Estimate a signiﬁcance probability by generating simulated random samples from the hypothesized distribution p.' in Figure 1, and 'Repeating this procedure on 10000 simulated samples of size n = 30 drawn from the null distribution...' in Example 4). It does not involve predefined training/test/validation dataset splits typically found in machine learning contexts.
Hardware Specification	No	No specific hardware details (e.g., GPU/CPU models, memory) are mentioned for running the experiments.
Software Dependencies	No	The paper discusses various algorithms and methods (e.g., isomap, classical multidimensional scaling, majorization, Newton's method, Floyd-Warshall algorithm) but does not provide specific software names with version numbers (e.g., Python 3.8, PyTorch 1.9) used for implementation.
Experiment Setup	Yes	Fix σ = ψ((π/4, arctan 2)), n = 30, and α = 0.05. For m = 25, 100, 400 and a = 1, . . . , 5, generate τ1, . . . , τm Uniform[0, π/2]2. Compute σi = ψ(τi). Set B = 1000. 1. Construct a representation of the submanifold in ℜ2. (a) Compute the pairwise Hellinger distances between σ, σ1, . . . , σm. Construct G by connecting vertices i and j if either vertex i is one of vertex j s K = 10 nearest neighbors or vice versa. (b) Compute the pairwise shortest path distances in G. Embed the shortest path distances in ℜ2 by minimizing the raw stress criterion, obtaining z, z1, . . . , zm. 2. Estimate the critical value. For b = 1, . . . , B, draw o from a multinomial distribution with n trials and probability vector σ. (a) Compute the Hellinger distances between o/n and σ, σ1, . . . , σm and determine the ℓ= 3 nearest neighbors of o/n.