Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Dynamical Properties of Tokens in Self-Attention and Effects of Positional Encoding

Authors: Duy-Tung Pham, An Nguyen The, Viet-Hoang Tran, Nhan-Phu Chung, Xin Tong, Tan M. Nguyen, Thieu N. Vo

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical evidence reveals that the convergence scenario adversely impacts model performance. Motivated by these insights, we propose simple refinements to Transformer architectures that mitigate convergence behavior in models with absolute or rotary positional encoding. To verify our findings, we conducted language modeling experiments on Wiki Text-103 [34] and En Wik8 [20], and object recognition on Image Net-1K [11].
Researcher Affiliation Collaboration Duy-Tung Pham FPT Software AI Center Hanoi, Vietnam EMAIL An Nguyen The FPT Software AI Center Hanoi, Vietnam EMAIL Viet-Hoang Tran Department of Mathematics National University of Singapore EMAIL Nhan-Phu Chung University of Economics Ho Chi Minh City Ho Chi Minh City, Vietnam EMAIL Xin T. Tong Department of Mathematics National University of Singapore EMAIL Tan M. Nguyen Department of Mathematics National University of Singapore EMAIL Thieu N. Vo Department of Mathematics National University of Singapore EMAIL
Pseudocode No The paper discusses various dynamical systems and provides mathematical equations (e.g., equation 2, 4, 5, 8, 9, 14, 15, 16, 17, 18, 19) and proofs in the appendix. However, it does not contain structured pseudocode or algorithm blocks with step-by-step instructions typically found in pseudocode.
Open Source Code Yes We also provide the code to reproduce the results in the paper, which can be found in the supplemental material.
Open Datasets Yes To verify our findings, we conducted language modeling experiments on Wiki Text-103 [34] and En Wik8 [20], and object recognition on Image Net-1K [11].
Dataset Splits Yes Wiki Text-103. [34] ... Its training set includes around 28,000 articles, totaling 103 million tokens. ... The validation and test sets each consist of 60 articles, containing 218,000 and 246,000 tokens, respectively. En Wik8. [20] ... The standard split provides 90 million bytes for training and 5 million for testing. Image Net-1K. [11] This dataset spans 1000 object classes and contains 1,281,167 training images, 50,000 validation images.
Hardware Specification Yes Training utilizes 2 NVIDIA A100 SXM4 80GB GPUs with a total batch size of 60. ... Training was performed on a NVIDIA A100 SXM4 80GB GPU. ... The model was trained with a batch size of 64 across 4 NVIDIA A100 SXM4 80GB GPUs using mixed precision.
Software Dependencies No We utilize the Transformer-XL [9] (https://github.com/kimiyoung/transformer-xl) architecture... We trained an autoregressive Transformer model on the Enwik8 dataset using the xtransformers (https://github.com/lucidrains/x-transformers) library. ... We adopted the Dei T [50] (https://github.com/facebookresearch/deit) architecture... While specific software packages (Transformer-XL, xtransformers, DeiT) and their GitHub links are provided, explicit version numbers for these or other crucial software dependencies (like Python, PyTorch, CUDA versions) are not mentioned.
Experiment Setup Yes Training is conducted using the Adam optimizer with a learning rate of 0.00025. A linear warmup is applied for the first 1,000 steps, followed by a cosine annealing schedule over a total of 200,000 training steps. The model is trained with a target sequence length of 150 tokens and no memory length, effectively disabling the segment-level recurrence mechanism. ... We used the Adam optimizer with a learning rate of 10-4, and applied gradient clipping with a maximum norm of 0.5. ... The model was trained on the full Image Net-1k training set for 300 epochs using the Adam W optimizer with a base learning rate of 5x10-4, and weight decay of 0.05. ... A stochastic depth rate (drop path) of 0.1 were used for regularization.