Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Recurrent Self-Attention Dynamics: An Energy-Agnostic Perspective from Jacobians

Authors: Akiyoshi Tomihari, Ryo Karakida

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we deepen the understanding of SA by extending energy-based analysis and employing a more general stability analysis from a dynamical systems perspective. First, we revisit the energy-based formulation and partially relax traditional architectural constraints, such as symmetric weights and single-head assumptions, to better approximate realistic SA settings (Section 4). These relaxed constraints provide insights into designing regularization methods, which we experimentally explore later in Section 6.2.
Researcher Affiliation Academia Akiyoshi Tomihari1,2 Ryo Karakida1,3 1Artificial Intelligence Research Center, AIST, Japan 2Department of Computer Science, The University of Tokyo, Japan 3RIKEN Center for Advanced Intelligence Project
Pseudocode No The paper describes mathematical formulations and updates (e.g., equations 1-9) but does not present any explicitly labeled 'Pseudocode' or 'Algorithm' block with structured steps for a method.
Open Source Code No Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: The code and data used in the experiments are not publicly available, and we do not plan to release them. As a result, the supplemental material does not contain instructions for reproducing the main experimental results.
Open Datasets Yes In our experiments, we used two Sudoku datasets: the SATNet [Wang et al., 2019] and RRN dataset [Palm et al., 2018]. The key differences between the two are that the RRN dataset is more difficult (with only 17 34 given digits compared to 31 42 in SATNet) and larger in size (198k samples vs. 10k samples). We also conducted experiments on the CIFAR-10 dataset [Krizhevsky et al.]. For the computation of the SA s Jacobian in Figure 1, we used the CCDV ar Xiv summarization dataset [Cohan et al., 2018] To evaluate in a more realistic scenario, we conducted language modeling experiments on the Baby LM Challenge dataset (2023, 10M version) [Diehl Martinez et al., 2023]
Dataset Splits Yes Following Miyato et al. [2025], we used the SATNet dataset for training as in-distribution (ID) data and the RRN dataset as out-of-distribution (OOD) data. This setup allows us to evaluate the ability of models to generalize to more challenging settings.
Hardware Specification Yes All experiments were conducted on NVIDIA H200 GPUs, and we run experiments with 5 different random seeds.
Software Dependencies No We used the Adam optimizer [Kingma and Ba, 2015] and trained for 100 epochs with batch size 100. For all settings, we tuned the learning rate over {1 10 6, 5 10 6, . . . , 1 10 3} and, for regularization methods in Figure 5, the parameter λ over {1 10 8, 1 10 7, . . . , 1 10 1}, selecting values based on OOD accuracy at iteration T = 16. The paper mentions the Adam optimizer and refers to an implementation from a previous work, but does not provide specific version numbers for any software libraries or programming languages used.
Experiment Setup Yes We used the Adam optimizer [Kingma and Ba, 2015] and trained for 100 epochs with batch size 100. For all settings, we tuned the learning rate over {1 10 6, 5 10 6, . . . , 1 10 3} and, for regularization methods in Figure 5, the parameter λ over {1 10 8, 1 10 7, . . . , 1 10 1}, selecting values based on OOD accuracy at iteration T = 16. All experiments were conducted on NVIDIA H200 GPUs, and we run experiments with 5 different random seeds. Table S.1: Training and model configurations. Parameter Sudoku CIFAR-10 Hidden dimension D 512 384 Number of heads H 8 8 Initial value of η 1.0 1.0 Batch size 100 128 Number of epochs 100 200