Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
nGPT: Normalized Transformer with Representation Learning on the Hypersphere
Authors: Ilya Loshchilov, Cheng-Ping Hsieh, Simeng Sun, Boris Ginsburg
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that the normalized Transformer (n GPT) reduces the number of training steps required to achieve the same accuracy by a factor of 4 to 20. 3 EXPERIMENTS We train both the baseline Transformer (GPT) and the normalized Transformer (n GPT) on the Open Web Text dataset (Gokaslan & Cohen, 2019) and evaluate them on a set of standard downstream tasks. We experiment with models containing 0.5B and 1B parameters, including the embeddings. For both GPT and n GPT, we report results using the best initial learning rate settings (see Appendix A.7). |
| Researcher Affiliation | Industry | Ilya Loshchilov, Cheng-Ping Hsieh, Simeng Sun & Boris Ginsburg NVIDIA EMAIL |
| Pseudocode | No | The paper describes the modifications to the Transformer architecture using mathematical equations and textual descriptions (e.g., in Section 2.6 Summary of Modifications), but it does not include a formally structured pseudocode or algorithm block. |
| Open Source Code | Yes | In order to illustrate how n GPT works, we reimplemented n GPT using nano GPT (Karpathy, 2023) and published our re-implementation at https://github.com/NVIDIA/ngpt. |
| Open Datasets | Yes | We train both the baseline Transformer (GPT) and the normalized Transformer (n GPT) on the Open Web Text dataset (Gokaslan & Cohen, 2019) and evaluate them on a set of standard downstream tasks. We investigate the length extrapolation ability of n GPT by evaluating its perplexity on the PG19 dataset, as shown in Figure 14. |
| Dataset Splits | No | The paper mentions 'Validation loss' and 'Training tokens in billions' in figures and accompanying text, indicating the use of training and validation sets. However, it does not provide specific details on how these splits were created (e.g., percentages, sample counts, or explicit splitting methodology). |
| Hardware Specification | Yes | We trained our models using 64 A100 GPUs distributed across 8 nodes (8 GPUs per node). |
| Software Dependencies | No | We use the LLaMA-2 tokenizer with 32k tokens. All experiments described in this paper were performed using an internal library based on Megatron-LM (Shoeybi et al., 2019). In order to illustrate how n GPT works, we reimplemented n GPT using nano GPT (Karpathy, 2023) and published our re-implementation at https://github.com/NVIDIA/ngpt. While several software tools are mentioned, specific version numbers for these tools or other key libraries (e.g., Python, PyTorch, CUDA) are not provided. |
| Experiment Setup | Yes | Table 2: Model Parameters for GPT and n GPT Number of Layers (nlayers) 24 36 Model Dimension (dmodel) 1024 1280 Number of Attention Heads (nheads) 16 20 Key Dimension (dk) dmodel/nheads dmodel/nheads MLP Dimension (d MLP) 4dmodel 4dmodel Table 3: Optimization Parameters for GPT and n GPT Optimizer Adam W Adam (Adam W with weight decay 0.0) Weight Decay 0.1 0.0 Number of Warmup Steps 2000 0 Learning Rate Schedule Cosine Annealing Cosine Annealing Initial Learning Rate problem-specific problem-specific Final Learning Rate 0 0 All matrix parameters are initialized by sampling from a zero-mean normal distribution with a standard deviation of 0.02 for GPT and 1/ dmodel for n GPT. The standard deviation for the output matrices was scaled by a factor of p2 nlayer, as suggested by Radford et al. (2018). The base of Ro PE is 10000. The initialization of the additional parameters introduced in n GPT is described in Section 2.6. |