Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

GPAS: Accelerating Convergence of LLM Pretraining via Gradient-Preserving Activation Scaling

Authors: Tianhao Chen, Xin Xu, Zijing Liu, Pengxiang Li, Xinyuan Song, AJAY JAISWAL, Fan Zhang, Jishan Hu, Yang Wang, Hao CHEN, Shizhe Diao, Shiwei Liu, Yu Li, Lu Yin, Can Yang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments across various model sizes from 71M to 1B show that GPAS achieves consistent performance gains. Beyond enhancing Pre-LN Transformers, GPAS also shows promise in improving alternative architectures such as Sandwich-LN and Deep Norm, demonstrating its versatility and potential for improving training dynamics in a wide range of settings. Our main contributions are summarized as follows: We validate the effectiveness of GPAS through pretraining experiments across various model sizes, and show that its benefits carry over to downstream supervised finetuning tasks. We conduct a thorough analysis of the training dynamics and model properties of models with and without GPAS. Section 4: Experiments and Results. Section 5: Analysis.
Researcher Affiliation Collaboration 1The Hong Kong University of Science and Technology 2International Digital Economy Academy 3Dalian University of Technology 4Emory University 5University of Texas at Austin 6NVIDIA 7University of Oxford 8University of Surrey
Pseudocode No The paper describes methods using mathematical equations and architectural diagrams (Figure 1), but does not contain explicitly structured pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/dandingsky/GPAS.
Open Datasets Yes We adopt the Adam [35] optimizer and train on the C4 dataset [36, 37]. We finetune the models on the Commonsense170K dataset [38] and evaluate the models on seven downstream tasks. C4 dataset: This is the dataset we used in our pretraining experiments. It s released under the Open Data Commons License Attribution family License. Commonsense 170K dataset: This is the dataset we used for supervised fine-tuning, released under MIT License.
Dataset Splits No The paper mentions using the C4 dataset for pretraining and the Commonsense170K dataset for supervised finetuning and evaluation on downstream tasks. It provides general training steps and evaluation steps, but does not specify explicit train/test/validation splits (e.g., percentages, sample counts, or citations to specific splits) for these datasets as applied to their experiments.
Hardware Specification Yes All experiments are carried out on 4 NVIDIA H800 GPUs.
Software Dependencies No Following [28], we adopt the Adam [35] optimizer and train on the C4 dataset [36, 37]. We tokenize the pretraining corpus with T5 tokenizer [36] since it was trained on the C4 dataset. All models share the same attention and FFN architectures as well as normalization layers except for normalization scheme. Specifically, all baseline architectures (from Post-LN to LNS) utilize RMSNorm [17], LLa MA attention and LLa MA MLP [32] with Swi GLU activation [33]. Following [12], we finetune the models on the Commonsense170K dataset [38] and evaluate the models on seven downstream tasks. The learnable gates αl are frozen during SFT to avoid disturbing pretrained knowledge. We use a learning rate of 3 10 4 and train for 4 epochs using LISA [39]. As for evaluation, we adopt the widely used LM Evaluation Harness [40].
Experiment Setup Yes Pretraining. We perform pretraining experiments across all architectures and at five model scales: 71M, 130M, 250M, 350M, and 1B. For the larger 7B configuration, we only pretrain Pre-LN and Pre + GPAS due to limited computational resources. Following [28], we adopt the Adam [35] optimizer and train on the C4 dataset [36, 37]. We tokenize the pretraining corpus with T5 tokenizer [36] since it was trained on the C4 dataset. For models with GPAS, we initialize all learnable gates αl = 0. We also use the same gate value for both attention and FFN sub-layers within the same layer. Table 1: Pretraining configurations for models of different sizes. Model Size, Learning Rate, Warmup Steps, Training Steps, Batch Size, Train Tokens, Eval Tokens (with specific values for each model size). Supervised finetuning. Following [12], we finetune the models on the Commonsense170K dataset [38] and evaluate the models on seven downstream tasks. The learnable gates αl are frozen during SFT to avoid disturbing pretrained knowledge. We use a learning rate of 3 10 4 and train for 4 epochs using LISA [39]. For 7B models: We follow [41] and use a learning rate of 3 10 4 with 10K warmup steps and cosine decay. Batch size is set to 2048 and scheduled to train for 150K steps on 60B tokens. We use gradient clipping of 0.01 on gate parameters αl and 1.0 on other parameters to stabilize training.