Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Correlation Dimension of Autoregressive Large Language Models
Authors: Xin Du, Kumiko Tanaka-Ishii
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments, we show that correlation dimension (1) reveals three distinct phases during pretraining, (2) reflects context-dependent complexity, (3) indicates a model s tendency toward hallucination, and (4) reliably detects multiple forms of degeneration in generated text. |
| Researcher Affiliation | Academia | Xin Du Waseda University EMAIL Kumiko Tanaka-Ishii Waseda University EMAIL |
| Pseudocode | Yes | Algorithm 1 Fused Blockwise Distance-and-Count for Correlation Integral |
| Open Source Code | Yes | We use primarily open-source datasets and models, and we provide sufficient information in the appendices for reproducing the experiments. |
| Open Datasets | Yes | Figure 2 illustrates correlation dimension for various pre-trained LLMs across the Stanford Encyclopedia of Philosophy (SEP) [55] dataset which is summarized in Appendix C. For each language, we selected 10 books from Project Gutenberg. We used an unlimited context length and measured the correlation dimension on the first 10,000 tokens of each book. |
| Dataset Splits | Yes | When measuring the correlation dimension, we truncated each article to the first 20,000 tokens. ... For each language, we selected 10 books from Project Gutenberg. ... measured the correlation dimension on the first 10,000 tokens of each book. |
| Hardware Specification | No | Resources required are presented in Appendix A. Appendix A talks about computational cost, GPU kernel fusion, and Table 6 provides runtime comparisons for "torch.pdist" and "torch.cdist" but does not specify the actual hardware (e.g., GPU model, CPU model, memory amount) used for these measurements. |
| Software Dependencies | No | The paper mentions "Py Torch implementations" in Table 6. It also mentions "CUDA kernel" and "GPTQ [18] and AWQ [34]-quantized models". However, specific version numbers for PyTorch, CUDA, or other libraries are not given. |
| Experiment Setup | Yes | Varying the context length parameter c in Eq. (3) imposes restrictions on the model s available context, directly influencing its complexity perception. A context length of c = 1 reduces the model effectively to a bigram approximation, while longer contexts progressively enable deeper linguistic comprehension. ... All models were evaluated with a temperature of 1.0. |