Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Approximate Information Tests on Statistical Submanifolds

Authors: Michael W. Trosset, Carey E. Priebe

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Examples illustrate the efficacy of the proposed methodology. Keywords: Restricted Inference, Dimension Reduction, Information Geometry, Minimum Distance Test. Section 7 reports a small simulation study designed to explore the effect of sampling density on performance.
Researcher Affiliation Academia Michael W. Trosset EMAIL Department of Statistics Indiana University Bloomington, IN 47408, USA Carey E. Priebe EMAIL Department of Applied Mathematics & Statistics Johns Hopkins University Baltimore, MD 21218-2682, USA
Pseudocode Yes Figure 1: An approximate information test for the case of an unknown submodel that can be sampled. Steps 2 4 are essentially isomap (Tenenbaum et al., 2000), used here to represent the Riemannian structure of a statistical manifold rather than a data manifold. Details are provided in Section 6.
Open Source Code No No explicit statement about the release of source code for the methodology described in this paper is found.
Open Datasets No The paper describes experiments based on statistical models (multinomial and trinomial distributions) and simulated data (e.g., 'o = (3, 5, 4, 6, 9, 2, 1)' in the Motivating Example, and generating 'τ1, . . . , τ100 Uniform[0, π/2]2' in Example 4), but does not use or provide concrete access information for a publicly available or open dataset.
Dataset Splits No The paper describes generating simulated random samples from hypothesized distributions for significance probability estimation and power analysis (e.g., 'Estimate a significance probability by generating simulated random samples from the hypothesized distribution p.' in Figure 1, and 'Repeating this procedure on 10000 simulated samples of size n = 30 drawn from the null distribution...' in Example 4). It does not involve predefined training/test/validation dataset splits typically found in machine learning contexts.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory) are mentioned for running the experiments.
Software Dependencies No The paper discusses various algorithms and methods (e.g., isomap, classical multidimensional scaling, majorization, Newton's method, Floyd-Warshall algorithm) but does not provide specific software names with version numbers (e.g., Python 3.8, PyTorch 1.9) used for implementation.
Experiment Setup Yes Fix σ = ψ((π/4, arctan 2)), n = 30, and α = 0.05. For m = 25, 100, 400 and a = 1, . . . , 5, generate τ1, . . . , τm Uniform[0, π/2]2. Compute σi = ψ(τi). Set B = 1000. 1. Construct a representation of the submanifold in ℜ2. (a) Compute the pairwise Hellinger distances between σ, σ1, . . . , σm. Construct G by connecting vertices i and j if either vertex i is one of vertex j s K = 10 nearest neighbors or vice versa. (b) Compute the pairwise shortest path distances in G. Embed the shortest path distances in ℜ2 by minimizing the raw stress criterion, obtaining z, z1, . . . , zm. 2. Estimate the critical value. For b = 1, . . . , B, draw o from a multinomial distribution with n trials and probability vector σ. (a) Compute the Hellinger distances between o/n and σ, σ1, . . . , σm and determine the ℓ= 3 nearest neighbors of o/n.