Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Token Embeddings Violate the Manifold Hypothesis

Authors: Michael Robinson, Sourya Dey, Tony Chiang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present a novel statistical test assuming that the neighborhood around each token has a relatively flat and smooth structure as the null hypothesis. Failing to reject the null is uninformative, but rejecting it at a specific token ψ implies an irregularity in the token subspace in a ψ-neighborhood, B(ψ). ... By running our test over several open-source LLMs, each with unique token embeddings, we find that the null is frequently rejected, and so the evidence suggests that the token subspace is not a fiber bundle and hence also not a manifold.
Researcher Affiliation	Collaboration	Michael Robinson Mathematics and Statistics American University Washington, DC, USA EMAIL Sourya Dey Galois, Inc. Arlington, VA, USA EMAIL Tony Chiang Department of Mathematics, University of Washington, Seattle, WA, USA EMAIL
Pseudocode	Yes	Algorithm 1 Manifold and fiber bundle tests Require: x1, . . . , xn Rℓ: coordinates for each point Require: vmin and vmax: minimum and maximum number of tokens in neighborhood Require: W: sliding window size Require: α: significance level Ensure: p1: set of p values for manifold hypothesis Ensure: p2: set of p values for fiber bundle hypothesis Ensure: Set of dimension estimates
Open Source Code	Yes	Source code is available at Robinson [2025].
Open Datasets	Yes	To demonstrate our method, Figure 3(a) (c) shows the application of the manifold test to several synthetic examples. In Section 5.2, we then provide empirical evidence that the token embedding function obtained from each of four different open source LLMs of moderate size GPT2 (Radford et al. [2019]), Llemma7B (Azerbayev et al. [2024]), Mistral7B (Jiang et al. [2023]), and Pythia6.9B (Biderman et al. [2023]) cannot be a manifold due to non-constant local dimension as well as tokens (points) without an intrinsic dimension.
Dataset Splits	No	The paper analyzes pre-existing token embeddings from open-source LLMs (GPT2, Llemma7B, Mistral7B, and Pythia6.9B) but does not describe any specific training, test, or validation splits for its own experimental methodology. The method is applied directly to the full set of token embeddings.
Hardware Specification	Yes	Applying our entire method (starting with the token embedding matrix and ending with the p-values for each of the three tests at every token) to each model took approximately 12 hours of wall clock time (per model) on an Intel Core i7-3820 with 32 GB of CPU RAM and no GPU running at 3.60GHz.
Software Dependencies	No	The paper mentions using numpy.gradient() and scipy.spatial.distance_matrix() for parts of its algorithm, but it does not specify version numbers for these libraries or for the core programming language (e.g., Python).
Experiment Setup	Yes	For each LLM we studied, we chose two pairs of vmin and vmax parameters in Algorithm 1: one yielding a small radius neighborhood and one yielding a larger radius neighborhood. These parameters were chosen by inspecting a small simple random sample of tokens, with the aim of identifying where changes in dimension were likely to occur. See Section A.6 for details.