Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Correlation Dimension of Autoregressive Large Language Models

Authors: Xin Du, Kumiko Tanaka-Ishii

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments, we show that correlation dimension (1) reveals three distinct phases during pretraining, (2) reflects context-dependent complexity, (3) indicates a model s tendency toward hallucination, and (4) reliably detects multiple forms of degeneration in generated text.
Researcher Affiliation	Academia	Xin Du Waseda University EMAIL Kumiko Tanaka-Ishii Waseda University EMAIL
Pseudocode	Yes	Algorithm 1 Fused Blockwise Distance-and-Count for Correlation Integral
Open Source Code	Yes	We use primarily open-source datasets and models, and we provide sufficient information in the appendices for reproducing the experiments.
Open Datasets	Yes	Figure 2 illustrates correlation dimension for various pre-trained LLMs across the Stanford Encyclopedia of Philosophy (SEP) [55] dataset which is summarized in Appendix C. For each language, we selected 10 books from Project Gutenberg. We used an unlimited context length and measured the correlation dimension on the first 10,000 tokens of each book.
Dataset Splits	Yes	When measuring the correlation dimension, we truncated each article to the first 20,000 tokens. ... For each language, we selected 10 books from Project Gutenberg. ... measured the correlation dimension on the first 10,000 tokens of each book.
Hardware Specification	No	Resources required are presented in Appendix A. Appendix A talks about computational cost, GPU kernel fusion, and Table 6 provides runtime comparisons for "torch.pdist" and "torch.cdist" but does not specify the actual hardware (e.g., GPU model, CPU model, memory amount) used for these measurements.
Software Dependencies	No	The paper mentions "Py Torch implementations" in Table 6. It also mentions "CUDA kernel" and "GPTQ [18] and AWQ [34]-quantized models". However, specific version numbers for PyTorch, CUDA, or other libraries are not given.
Experiment Setup	Yes	Varying the context length parameter c in Eq. (3) imposes restrictions on the model s available context, directly influencing its complexity perception. A context length of c = 1 reduces the model effectively to a bigram approximation, while longer contexts progressively enable deeper linguistic comprehension. ... All models were evaluated with a temperature of 1.0.