Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Dynamics of Spontaneous Topic Changes in Next Token Prediction with Self-Attention

Authors: Mumin Jia, Jairo Diaz-Rodriguez

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	First, we establish theoretical results under a simplified, singlelayer self-attention model... Second, we empirically validate that the effect of input length or topic ambiguity persists in modern, state-of-the-art LLMs, underscoring a fundamental disparity between human cognition and AI behavior in the context of spontaneous topic changes. ... In Section 6 we empirically extend Theorem 4 to modern, deeper LLMs.
Researcher Affiliation	Academia	Mumin Jia Department of Mathematics and Statistics York University Toronto, Ontario M3J 1P3 EMAIL Jairo Diaz-Rodriguez Department of Mathematics and Statistics York University Toronto, Ontario M3J 1P3 EMAIL
Pseudocode	No	The paper describes algorithms and procedures using mathematical notation and prose, for instance, "W(τ+1) = W(τ) η L(W(τ)). (Algo-GD)". However, it does not contain any explicitly labeled or structured pseudocode blocks or algorithm boxes.
Open Source Code	Yes	Code. The source code can be found on Git Hub: https://github.com/muminjia/Dynamics-of Spontaneous-Topic-Changes
Open Datasets	No	Real dataset. We randomly select 100 arXiv papers published in March 2025 since the publicly disclosed knowledge cutoff dates for our study LLMs fall at the end of 2024 or earlier. This ensures that these models have not been trained on these data.
Dataset Splits	No	Generate training datasets DSETa and DSETb based on {G(k) a,theor}K k=1 and {G(k) b,theor}K k=1, respectively. For each input sequence in DSET, the sequence length Ttrain is 4, which means X = [x1 x2 x Ttrain] RTtrain d with xi from E = [e1, e2, ...e K] . ... To differentiate the input sequence length of the testing data from that of the training data, we introduce Ttest. TPGs based on the training dataset DSETa are utilized to generate test datasets consisting of 100 sequences from Ta per epoch.
Hardware Specification	No	In our experiments on LLMs, we query GPT-4o, Llama-3.3, Claude-3.7, and Deep Seek-V3 through API calls. All experiments were conducted on a standard laptop without specialized hardware.
Software Dependencies	No	We employ a single-layer attention mechanism implemented in Py Torch. The model is trained using the SGD optimizer with a learning rate η = 0.01 for 8000 iterations. ... prior to using the CVXPY package to get Wsvm, SCCs are identified for each TPG derived from the using Tarjan s algorithm.
Experiment Setup	Yes	The model is trained using the SGD optimizer with a learning rate η = 0.01 for 8000 iterations. ... All LLMs are set with a temperature of 0 to match the greedy decoding in our theoretical framework. The maximum completion length was set to 1000 tokens to ensure that the generated continuations could complete the abstract.