Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

On the Expressiveness of Softmax Attention: A Recurrent Neural Network Perspective

Authors: Gabriel Mongaras, Eric C. Larson

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To evaluate our hypothetical softmax alternative, we train multiple Llama 2 Touvron et al. (2023) Touvron et al. (2023) models for next token language modeling. Keeping the rest of the architecture constant, we replace the attention mechanism with the proposed variations. We show log loss to emphasize the differences between each model. Section 4.1 shows our proposed replacement is empirically equivalent to softmax. Section 4.2 examines the scalability of the proposed replacement. Section 4.3 evaluates linear attention against normal softmax and our proposed methods. As our method uses a Taylor expansion, Section 4.4 looks at the performance of different linear attention variations via progressive additions of higher order powers. Section 4.5 ablates various elements of the recurrent softmax attention.
Researcher Affiliation	Academia	Gabriel Mongaras EMAIL Department of Computer Science Southern Methodist University Eric C. Larson EMAIL Institute for Computational Biosciences Southern Methodist University
Pseudocode	No	The paper includes mathematical derivations and figures (e.g., Figure 1 for visual representation of Softmax attention as an RNN), but it does not contain explicit pseudocode or algorithm blocks with structured, step-by-step procedures.
Open Source Code	Yes	Code found at: https://github.com/gmongaras/On-the-Expressiveness-of-Softmax-Attention-A-Recurrent-Neural-Network-Perspective
Open Datasets	Yes	we retrain on three datasets: The Pile Gao et al. (2021), Slim Pajama Shen et al. (2023), and Fine Web Penedo et al. (2024).
Dataset Splits	Yes	Test percentage 0.001
Hardware Specification	Yes	For most experiments, we use distributed data parallel processing to train on two 80 GB, A100 GPUs with the exception of the large model, trained on 4 GPUs, and 4096 sequence length, trained on 6 GPUs.
Software Dependencies	No	Unless otherwise mentioned, the below are the parameters we used in our models. As our base model is llama 2 Touvron et al. (2023). Ro PE Su et al. (2024) is used on the attention matrix and the MLPs follow Swi GLU Shazeer (2020). The paper mentions specific architectures (Llama 2, RoPE, SwiGLU) and general concepts (distributed data parallel processing), but it does not provide specific version numbers for software libraries or environments like Python, PyTorch, or CUDA.
Experiment Setup	Yes	batch size 36, learning rate 1e-4, warmup steps 10,000, warmup type linear warmup from 0, linear decay, num steps 100,000, precision float32 and bfloat16 mixed precision, Weight decay 0.01, Max sequence length 1024 for general experiments, 4096 for length scaling experiment, Test percentage 0.001, Optimizer Adam W, Adam betas 0.9 and 0.999, Hidden size 1024 (3072 for the large model), MLP intermediate size 2048 (6144 for the large model), Num attention heads 16, Num hidden layers 20, Tokenizer llama2-7b-hf, Gradient clipping 1.0 clipping for gated models, no clipping for all other experiments