Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Learning in Compact Spaces with Approximately Normalized Transformer

Authors: Jörg Franke, Urs Spiegelhalter, Marianna Nezhurina, Jenia Jitsev, Frank Hutter, Michael Hefenbrock

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Section 5 presents extensive experimental evaluations: Hyperparameter scaling trends are derived across multiple model sizes. Results demonstrate over 40% convergence speedup compared to GPT models with QK normalization and outperform or perform on par with n GPT. Compute-dependent scaling laws reveal scaling behavior matching GPT.
Researcher Affiliation Collaboration 1University of Freiburg 2ELLIS Institute Tübingen 3Open-Sci Collective 4LAION 5Jülich Supercomputing Centre (JSC) 6Prior Labs 7Perspix.ai
Pseudocode No The paper describes the architecture and modifications textually and via mathematical derivations, but it does not contain explicit pseudocode or algorithm blocks.
Open Source Code Yes An open-source implementation of an GPT is available at https://github.com/automl/an GPT.
Open Datasets Yes We used two datasets in this paper, Slim Pajama (Apache 2.0 license) [14] and Open Web Text (Creative Commons Zero v1.0) [24]1. Slim Pajama provides a validation set, and for Open Web Text, we used 10k randomly selected documents as a validation set. 1Both datasets are accessible on Huggingface: https://huggingface.co/datasets/cerebras/ Slim Pajama-627B and https://huggingface.co/datasets/Skylion007/openwebtext
Dataset Splits No The paper states: "Slim Pajama provides a validation set, and for Open Web Text, we used 10k randomly selected documents as a validation set." While this indicates the presence of a validation set, it does not provide specific split percentages or counts for the entire dataset (e.g., train/test/validation ratios or exact sizes of each split for Slim Pajama), which is needed to fully reproduce the data partitioning.
Hardware Specification Yes Table 2: Runtime Comparison with a 0.5B parameter model on a GPU node with 4 A100 (40GB) GPUs with a sequence length of 2048 and a batch size of 8. The experiments use torch.compile with default settings. Model Avg. Runtime Rel. per Step Increase (%) GPT+ 0.1416 an GPT 0.1455 2.75 n GPT 0.1552 9.60. We performed all experiments on a research cluster with 4 A100 40GB GPU nodes and used in total about 30k GPU hours.
Software Dependencies Yes We implemented our experiments in Py Torch 2.6 [34] and used Flash Attention 0.7.3 [33]. All plots are generated with Matplotlib [35].
Experiment Setup Yes Table F.2: If not other specified, we used the following hyperparameters of the GPT+, n GPT, and an GPT training runs in the experiment section. Parameter GPT+ n GPT/an GPT Gradient Clip Val 1.0 Precision bf16-mixed Optimizer Adam W Adam Beta1 0.9 Beta2 0.95 Eps 1.0 10 9 Weigth decay 0.1 0 Lr Num warm-up Steps 20% 0 Lr Decay Factor 0.01 Lr Schedule Cosine Param. Scale Init 1/ dm / 0.001 Dropout 0 Rotary Pos Embed True Rotary Emb Fraction 0.5 Use Bias False Flash Attention True Torch Compile True Context size 2048