Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
The Curse of Depth in Large Language Models
Authors: Wenfang Sun, Xinyuan Song, Pengxiang Li, Lu Yin, Yefeng Zheng, Shiwei Liu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate the effectiveness of Layer Norm Scaling, we follow the experimental setup of Li et al. [26], using the identical model configurations and training conditions to compare LNS with widely used normalization techniques... Table 1: Perplexity ( ) comparison of various layer normalization methods... Figure 4: Left: Schematic diagrams of (a) Pre-LN and (b) Layer Norm Scaling. Right: Language modeling loss of scaling up parameter count up to 7B. All models are trained for 20B tokens using OLMo [15]. |
| Researcher Affiliation | Academia | Wenfang Sun1 Xinyuan Song2 Pengxiang Li3 Lu Yin4 Yefeng Zheng1 Shiwei Liu5 1Westlake University 2Emory University 3Dalian University of Technology 4University of Surrey 5University of Oxford Corresponding author: EMAIL |
| Pseudocode | No | The paper describes methods and formulas textually, for example: 'Formally, for a Transformer model with L layers, the output of Layer Normalization in each layer ℓ is scaled by a factor of 1 / ℓ. Let h(ℓ) denote the input to Layer Normalization at layer ℓ. The modified output is computed as: h(ℓ) = Layer Norm(h(ℓ)) * (1 / sqrt(ℓ))'. Figure 4 shows schematic diagrams of Pre-LN and Layer Norm Scaling but not formal pseudocode. |
| Open Source Code | Yes | Our code is available at Layer Norm-Scaling. |
| Open Datasets | Yes | Specifically, we prune one layer at a time, without any fine-tuning, and directly evaluate the resulting pruned models on the MMLU benchmark [17]... For BERT-Large, we evaluate using the SQu AD v1.1 dataset [36], while for other models, we use MMLU [17]... tokens sampled from the C4 dataset... we perform SFT with the models obtained from Section 5.1 on the Commonsense170K dataset [18] across eight downstream tasks. |
| Dataset Splits | Yes | For BERT-Large, we evaluate using the SQu AD v1.1 dataset [36], while for other models, we use MMLU [17], a standard benchmark for multi-task language understanding. To reduce variance, we report the average distance over 256K tokens sampled from the C4 dataset. We perform SFT with the models obtained from Section 5.1 on the Commonsense170K dataset [18] across eight downstream tasks. We adopt the same fine-tuning configurations as used in Li et al. [26]. |
| Hardware Specification | No | modern LLMs are extremely resource-intensive to train, often requiring thousands of GPUs trained for multiple months... All models are trained on a fixed 20B-token budget to ensure comparability. |
| Software Dependencies | No | The architecture incorporates RMSNorm [37] and Swi GLU activations [56]... For optimization, we use the Adam optimizer [22] and adopt size-specific learning rates: 1e-3 for models up to 350M parameters, and 5e-4 for the 1B parameter model. |
| Experiment Setup | Yes | To evaluate the effectiveness of Layer Norm Scaling, we follow the experimental setup of Li et al. [26], using the identical model configurations and training conditions to compare LNS with widely used normalization techniques, including Post-LN [34], Deep Norm [47], and Pre-LN [8]. In line with Lialin et al. [27] and Zhao et al. [58], we conduct experiments using LLa MA-based architectures with model sizes of 130M, 250M, 350M, and 1B parameters... For optimization, we use the Adam optimizer [22] and adopt size-specific learning rates: 1e-3 for models up to 350M parameters, and 5e-4 for the 1B parameter model. All models share the same architecture, hyperparameters, and training schedule, with the only difference being the choice of normalization method. |