Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
On the Expressiveness of Softmax Attention: A Recurrent Neural Network Perspective
Authors: Gabriel Mongaras, Eric C. Larson
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate our hypothetical softmax alternative, we train multiple Llama 2 Touvron et al. (2023) Touvron et al. (2023) models for next token language modeling. Keeping the rest of the architecture constant, we replace the attention mechanism with the proposed variations. We show log loss to emphasize the differences between each model. Section 4.1 shows our proposed replacement is empirically equivalent to softmax. Section 4.2 examines the scalability of the proposed replacement. Section 4.3 evaluates linear attention against normal softmax and our proposed methods. As our method uses a Taylor expansion, Section 4.4 looks at the performance of different linear attention variations via progressive additions of higher order powers. Section 4.5 ablates various elements of the recurrent softmax attention. |
| Researcher Affiliation | Academia | Gabriel Mongaras EMAIL Department of Computer Science Southern Methodist University Eric C. Larson EMAIL Institute for Computational Biosciences Southern Methodist University |
| Pseudocode | No | The paper includes mathematical derivations and figures (e.g., Figure 1 for visual representation of Softmax attention as an RNN), but it does not contain explicit pseudocode or algorithm blocks with structured, step-by-step procedures. |
| Open Source Code | Yes | Code found at: https://github.com/gmongaras/On-the-Expressiveness-of-Softmax-Attention-A-Recurrent-Neural-Network-Perspective |
| Open Datasets | Yes | we retrain on three datasets: The Pile Gao et al. (2021), Slim Pajama Shen et al. (2023), and Fine Web Penedo et al. (2024). |
| Dataset Splits | Yes | Test percentage 0.001 |
| Hardware Specification | Yes | For most experiments, we use distributed data parallel processing to train on two 80 GB, A100 GPUs with the exception of the large model, trained on 4 GPUs, and 4096 sequence length, trained on 6 GPUs. |
| Software Dependencies | No | Unless otherwise mentioned, the below are the parameters we used in our models. As our base model is llama 2 Touvron et al. (2023). Ro PE Su et al. (2024) is used on the attention matrix and the MLPs follow Swi GLU Shazeer (2020). The paper mentions specific architectures (Llama 2, RoPE, SwiGLU) and general concepts (distributed data parallel processing), but it does not provide specific version numbers for software libraries or environments like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | batch size 36, learning rate 1e-4, warmup steps 10,000, warmup type linear warmup from 0, linear decay, num steps 100,000, precision float32 and bfloat16 mixed precision, Weight decay 0.01, Max sequence length 1024 for general experiments, 4096 for length scaling experiment, Test percentage 0.001, Optimizer Adam W, Adam betas 0.9 and 0.999, Hidden size 1024 (3072 for the large model), MLP intermediate size 2048 (6144 for the large model), Num attention heads 16, Num hidden layers 20, Tokenizer llama2-7b-hf, Gradient clipping 1.0 clipping for gated models, no clipping for all other experiments |