Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

nGPT: Normalized Transformer with Representation Learning on the Hypersphere

Authors: Ilya Loshchilov, Cheng-Ping Hsieh, Simeng Sun, Boris Ginsburg

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that the normalized Transformer (n GPT) reduces the number of training steps required to achieve the same accuracy by a factor of 4 to 20. 3 EXPERIMENTS We train both the baseline Transformer (GPT) and the normalized Transformer (n GPT) on the Open Web Text dataset (Gokaslan & Cohen, 2019) and evaluate them on a set of standard downstream tasks. We experiment with models containing 0.5B and 1B parameters, including the embeddings. For both GPT and n GPT, we report results using the best initial learning rate settings (see Appendix A.7).
Researcher Affiliation	Industry	Ilya Loshchilov, Cheng-Ping Hsieh, Simeng Sun & Boris Ginsburg NVIDIA EMAIL
Pseudocode	No	The paper describes the modifications to the Transformer architecture using mathematical equations and textual descriptions (e.g., in Section 2.6 Summary of Modifications), but it does not include a formally structured pseudocode or algorithm block.
Open Source Code	Yes	In order to illustrate how n GPT works, we reimplemented n GPT using nano GPT (Karpathy, 2023) and published our re-implementation at https://github.com/NVIDIA/ngpt.
Open Datasets	Yes	We train both the baseline Transformer (GPT) and the normalized Transformer (n GPT) on the Open Web Text dataset (Gokaslan & Cohen, 2019) and evaluate them on a set of standard downstream tasks. We investigate the length extrapolation ability of n GPT by evaluating its perplexity on the PG19 dataset, as shown in Figure 14.
Dataset Splits	No	The paper mentions 'Validation loss' and 'Training tokens in billions' in figures and accompanying text, indicating the use of training and validation sets. However, it does not provide specific details on how these splits were created (e.g., percentages, sample counts, or explicit splitting methodology).
Hardware Specification	Yes	We trained our models using 64 A100 GPUs distributed across 8 nodes (8 GPUs per node).
Software Dependencies	No	We use the LLaMA-2 tokenizer with 32k tokens. All experiments described in this paper were performed using an internal library based on Megatron-LM (Shoeybi et al., 2019). In order to illustrate how n GPT works, we reimplemented n GPT using nano GPT (Karpathy, 2023) and published our re-implementation at https://github.com/NVIDIA/ngpt. While several software tools are mentioned, specific version numbers for these tools or other key libraries (e.g., Python, PyTorch, CUDA) are not provided.
Experiment Setup	Yes	Table 2: Model Parameters for GPT and n GPT Number of Layers (nlayers) 24 36 Model Dimension (dmodel) 1024 1280 Number of Attention Heads (nheads) 16 20 Key Dimension (dk) dmodel/nheads dmodel/nheads MLP Dimension (d MLP) 4dmodel 4dmodel Table 3: Optimization Parameters for GPT and n GPT Optimizer Adam W Adam (Adam W with weight decay 0.0) Weight Decay 0.1 0.0 Number of Warmup Steps 2000 0 Learning Rate Schedule Cosine Annealing Cosine Annealing Initial Learning Rate problem-specific problem-specific Final Learning Rate 0 0 All matrix parameters are initialized by sampling from a zero-mean normal distribution with a standard deviation of 0.02 for GPT and 1/ dmodel for n GPT. The standard deviation for the output matrices was scaled by a factor of p2 nlayer, as suggested by Radford et al. (2018). The base of Ro PE is 10000. The initialization of the additional parameters introduced in n GPT is described in Section 2.6.