Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

AlphaDecay: Module-wise Weight Decay for Heavy-Tailed Balancing in LLMs

Authors: Di He, Songjun Tu, Ajay Jaiswal, Li Shen, Ganzhao Yuan, Shiwei Liu, Lu Yin

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive pre-training tasks with various model sizes from 60M to 1B demonstrate that Alpha Decay achieves better perplexity and generalization than conventional uniform decay and other adaptive decay baselines. The experimental setup (Section 4.1) is followed by a comparison between Alpha Decay and several baselines (Section 4.2). Finally, we analyze the impact of weight decay assignment functions, HT-SR module-wise metrics, PL fitting methods, and PL fitting time gaps through ablation studies (Section 4.4).
Researcher Affiliation Academia 1Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences 2Peng Cheng Laboratory 3University of Chinese Academy of Sciences 4University of Texas at Austin 5Shenzhen Campus of Sun Yat-sen University 6Shenzhen University of Advanced Technology 7University of Oxford 8University of Surrey EMAIL
Pseudocode Yes Algorithm 1: Alpha Decay
Open Source Code Yes The code is available at https://github.com/heducas/Alpha Decay.
Open Datasets Yes All experiments employ the C4 dataset [31], a rigorously processed subset of Common Crawl widely adopted for language model pretraining.
Dataset Splits No Our experimental design incorporates two key components: (1) a non-repeating data regime with sufficient tokens for convergence, and (2) standardized preprocessing pipelines across all model scales. This multi-scale approach facilitates systematic comparison of model behaviors across different capacity regimes, while minimizing potential confounding factors in the analysis.
Hardware Specification Yes The computation times reflect the NVIDIA A100 hours utilized for completing model training.
Software Dependencies No The paper mentions 'Adam optimizer' and 'Adam W optimizer' for training, but does not provide specific version numbers for any software libraries, frameworks, or programming languages used in the implementation of the methodology.
Experiment Setup Yes Table 1: Hyperparameters used in pre-training experiments. Model Size LR Tokens Weight Decay (s1, s2) ... Table 6: Hyperparameters of LLa Ma models used in this paper. Params Hidden Intermediate Heads Layers Steps Data amount LR Batch Size ... Hyperparameters. The detailed hyperparameter settings for all model sizes are summarized in Table 1. All models are trained with Adam optimizer (gradient clipping at 1.0) and a cosine learning rate schedule, with 10% of the training tokens used for learning rate warmup.