Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Learning in Compact Spaces with Approximately Normalized Transformer
Authors: Jörg Franke, Urs Spiegelhalter, Marianna Nezhurina, Jenia Jitsev, Frank Hutter, Michael Hefenbrock
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Section 5 presents extensive experimental evaluations: Hyperparameter scaling trends are derived across multiple model sizes. Results demonstrate over 40% convergence speedup compared to GPT models with QK normalization and outperform or perform on par with n GPT. Compute-dependent scaling laws reveal scaling behavior matching GPT. |
| Researcher Affiliation | Collaboration | 1University of Freiburg 2ELLIS Institute Tübingen 3Open-Sci Collective 4LAION 5Jülich Supercomputing Centre (JSC) 6Prior Labs 7Perspix.ai |
| Pseudocode | No | The paper describes the architecture and modifications textually and via mathematical derivations, but it does not contain explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | An open-source implementation of an GPT is available at https://github.com/automl/an GPT. |
| Open Datasets | Yes | We used two datasets in this paper, Slim Pajama (Apache 2.0 license) [14] and Open Web Text (Creative Commons Zero v1.0) [24]1. Slim Pajama provides a validation set, and for Open Web Text, we used 10k randomly selected documents as a validation set. 1Both datasets are accessible on Huggingface: https://huggingface.co/datasets/cerebras/ Slim Pajama-627B and https://huggingface.co/datasets/Skylion007/openwebtext |
| Dataset Splits | No | The paper states: "Slim Pajama provides a validation set, and for Open Web Text, we used 10k randomly selected documents as a validation set." While this indicates the presence of a validation set, it does not provide specific split percentages or counts for the entire dataset (e.g., train/test/validation ratios or exact sizes of each split for Slim Pajama), which is needed to fully reproduce the data partitioning. |
| Hardware Specification | Yes | Table 2: Runtime Comparison with a 0.5B parameter model on a GPU node with 4 A100 (40GB) GPUs with a sequence length of 2048 and a batch size of 8. The experiments use torch.compile with default settings. Model Avg. Runtime Rel. per Step Increase (%) GPT+ 0.1416 an GPT 0.1455 2.75 n GPT 0.1552 9.60. We performed all experiments on a research cluster with 4 A100 40GB GPU nodes and used in total about 30k GPU hours. |
| Software Dependencies | Yes | We implemented our experiments in Py Torch 2.6 [34] and used Flash Attention 0.7.3 [33]. All plots are generated with Matplotlib [35]. |
| Experiment Setup | Yes | Table F.2: If not other specified, we used the following hyperparameters of the GPT+, n GPT, and an GPT training runs in the experiment section. Parameter GPT+ n GPT/an GPT Gradient Clip Val 1.0 Precision bf16-mixed Optimizer Adam W Adam Beta1 0.9 Beta2 0.95 Eps 1.0 10 9 Weigth decay 0.1 0 Lr Num warm-up Steps 20% 0 Lr Decay Factor 0.01 Lr Schedule Cosine Param. Scale Init 1/ dm / 0.001 Dropout 0 Rotary Pos Embed True Rotary Emb Fraction 0.5 Use Bias False Flash Attention True Torch Compile True Context size 2048 |