Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Scale-invariant attention
Authors: Ben Anson, Xi Wang, Laurence Aitchison
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimentally we find that the resulting scale-invariant attention scheme gives considerable benefits in terms of validation loss when zero-shot generalising from training on short contexts to validation on longer contexts, and is effective at long-context retrieval. One key challenge in modern LLMs is scaling up context length at inference time, while maintaining model performance. We approach this question of length generalisation by considering scale invariance. In this section, we compare scale-invariant attention with other dense attention methods including Dynamic NTK interpolation (Ro PE+NTK) (bloc97, 2023), Log N scaling/SSMax (Nakanishi, 2025), p-Ro PE (Barbero et al., 2024b), and ALi Bi (Press et al., 2021). Our results show that our method, scale-invariant p-Ro PE, has uniformly lower validation loss at a variety of training lengths (4k, 16k, 64k). |
| Researcher Affiliation | Academia | Ben Anson School of Mathematics University of Bristol EMAIL Xi Wang Department of Computer Science Johns Hopkins University EMAIL Laurence Aitchison School of Computer Science University of Bristol EMAIL |
| Pseudocode | No | No explicit pseudocode or algorithm blocks are present. The paper describes methods through mathematical formulations and text. |
| Open Source Code | Yes | We provide code in the supplementary materials along with instructions to reproduce the main results. |
| Open Datasets | Yes | We pretrained GPT-2-style models (Radford et al., 2019) (with QK-norm, ReLU2 activations, etc. (Jordan et al., 2024a)) from scratch on the Fine Web dataset (Penedo et al., 2024), using a fixed training data ordering and a 10M token validation set. Our tasks constructs prompts by concatenating text samples from the C4 dataset (Roberts et al., 2019) and embedding needles of the form The special magic <city> number is <7_digit_number>. This project uses a 10B subset of the fineweb dataset, https://huggingface.co/ datasets/kjj0/fineweb10B-gpt2, which is MIT licensed. This project uses a 100B subset of the fineweb dataset, https://huggingface.co/ datasets/kjj0/fineweb100B-gpt2, which is MIT licensed. This project uses the C4 dataset, https://huggingface.co/datasets/kjj0/ fineweb100B-gpt2, which licensed under the Open Civic Data Attribution License (OCDBY). |
| Dataset Splits | Yes | We pretrained GPT-2-style models (...) from scratch on the Fine Web dataset (...), using a fixed training data ordering and a 10M token validation set. The 162M one were trained for 4578 steps on 2.4B tokens over a range of context lengths (4k, 16k, 64k). The 304M parameter models were trained only on the best length-generalising methods, for 10.9k steps on 10B tokens, at 4k context length. We trained models on sequences of length 4k, and tested at 4k, 16k, and 64k. |
| Hardware Specification | Yes | We trained the smaller (162M) models on single A100 80G GPUs. We trained the 304M models on 4x H100 grace hopper nodes using distributed data parallelism. |
| Software Dependencies | Yes | Our base model is a modded-nanogpt (Jordan et al., 2024a) variant, which is similar to GPT2 (Radford et al., 2019)... We implemented scale-invariant attention and ALi Bi using Flex Attention (Dong et al., 2024). We trained using the Torchtune (2024) library, using data from Fine Web (Penedo et al., 2024), with Adam W and a learning rate of 2e-5. |
| Experiment Setup | Yes | We optimized embedding parameters using Adam with learning rate γ = 0.3, β = (0.9, 0.95). We optimized linear layers with Muon, with no weight decay, γ = 0.02 and momentum 0.95. For remaining parameters (unembedding and Log N scalings, if Log N trick was active), we optimized with Adam, using γ = 0.002, β = (0.9, 0.95). We trained for 4578 steps. The batch size was 8 * 65536/Ltr, with more gradient accumulation for shorter training lengths. We used a cosine learning rate schedule, with no warmup, and a minimum learning rate of 0 (at the end of training). |