Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Dynamics of Spontaneous Topic Changes in Next Token Prediction with Self-Attention
Authors: Mumin Jia, Jairo Diaz-Rodriguez
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | First, we establish theoretical results under a simplified, singlelayer self-attention model... Second, we empirically validate that the effect of input length or topic ambiguity persists in modern, state-of-the-art LLMs, underscoring a fundamental disparity between human cognition and AI behavior in the context of spontaneous topic changes. ... In Section 6 we empirically extend Theorem 4 to modern, deeper LLMs. |
| Researcher Affiliation | Academia | Mumin Jia Department of Mathematics and Statistics York University Toronto, Ontario M3J 1P3 EMAIL Jairo Diaz-Rodriguez Department of Mathematics and Statistics York University Toronto, Ontario M3J 1P3 EMAIL |
| Pseudocode | No | The paper describes algorithms and procedures using mathematical notation and prose, for instance, "W(τ+1) = W(τ) η L(W(τ)). (Algo-GD)". However, it does not contain any explicitly labeled or structured pseudocode blocks or algorithm boxes. |
| Open Source Code | Yes | Code. The source code can be found on Git Hub: https://github.com/muminjia/Dynamics-of Spontaneous-Topic-Changes |
| Open Datasets | No | Real dataset. We randomly select 100 arXiv papers published in March 2025 since the publicly disclosed knowledge cutoff dates for our study LLMs fall at the end of 2024 or earlier. This ensures that these models have not been trained on these data. |
| Dataset Splits | No | Generate training datasets DSETa and DSETb based on {G(k) a,theor}K k=1 and {G(k) b,theor}K k=1, respectively. For each input sequence in DSET, the sequence length Ttrain is 4, which means X = [x1 x2 x Ttrain] RTtrain d with xi from E = [e1, e2, ...e K] . ... To differentiate the input sequence length of the testing data from that of the training data, we introduce Ttest. TPGs based on the training dataset DSETa are utilized to generate test datasets consisting of 100 sequences from Ta per epoch. |
| Hardware Specification | No | In our experiments on LLMs, we query GPT-4o, Llama-3.3, Claude-3.7, and Deep Seek-V3 through API calls. All experiments were conducted on a standard laptop without specialized hardware. |
| Software Dependencies | No | We employ a single-layer attention mechanism implemented in Py Torch. The model is trained using the SGD optimizer with a learning rate η = 0.01 for 8000 iterations. ... prior to using the CVXPY package to get Wsvm, SCCs are identified for each TPG derived from the using Tarjan s algorithm. |
| Experiment Setup | Yes | The model is trained using the SGD optimizer with a learning rate η = 0.01 for 8000 iterations. ... All LLMs are set with a temperature of 0 to match the greedy decoding in our theoretical framework. The maximum completion length was set to 1000 tokens to ensure that the generated continuations could complete the abstract. |