Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

AdaLRS: Loss-Guided Adaptive Learning Rate Search for Efficient Foundation Model Pretraining

Authors: Hongyuan Dong, Dingkang Yang, Xiao Liang, ChaoFeng, Ran Jiao

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide theoretical and experimental analyzes to show that foundation model pretraining loss and its descent velocity are both convex and share the same optimal learning rate. Relying solely on training loss dynamics, Ada LRS involves few extra computations to guide the search process, and its convergence is guaranteed via theoretical analysis. Experiments on both LLM and VLM pretraining show that Ada LRS adjusts suboptimal learning rates to the neighborhood of optimum with marked efficiency and effectiveness, with model performance improved accordingly.
Researcher Affiliation Industry Hongyuan Dong, Dingkang Yang, Xiao Liang, Chao Feng , Jiao Ran Byte Dance Inc. EMAIL, EMAIL, EMAIL EMAIL, EMAIL
Pseudocode Yes Algorithm 1: Adaptive Learning Rate Scheduling (Ada LRS)
Open Source Code No Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We will release the proposed benchmark and code after the paper is accepted.
Open Datasets Yes We use a total of approximately 64M samples from the Slim Pajama [42] dataset for LLM training from scratch, with all samples shuffled randomly. For VLM pretraining, on the other hand, we adopt the model structure of SAIL-VL [16], with Intern Vi T-300M [13] and Qwen2.5-1.5B adopted as backbone models. We use a collection of detail caption and image OCR data to train the vision-to-language projector from scratch. Detail caption datasets are curated via a similar recaption procedure as described in SAIL-VL [16], while OCR data is collected from a series of opensource datasets [6, 45, 20].
Dataset Splits No For LLM pretraining, we quantify the performance advantage of models trained with Ada LRS with final training loss and perplexity (PPL) computed on Slim Pajama [42] train, validation, and test splits.
Hardware Specification Yes Approximately 120B and 160B tokens are used for LLM and VLM pretraining, with roughly 10,000 and 20,000 910B NPU hours consumed for 2B and 7B model pretraining experiments.
Software Dependencies No The paper mentions using a 'cosine learning rate scheduler [32]' and 'WSD [22] scheduler', but does not specify software library names with version numbers for implementation.
Experiment Setup Yes Table 1: Detailed hyperparameters for the main experiments. Fit , Large , and Small refer appropriate, too large, and too small learning rates, respectively. BSZ stands for batch size. Hyperparameter Fit Large Small Learning Rate 2e 4 2e 3 2e 5 2e 4 2e 3 2e 5 8e 3 4e 1 2e 4 BSZ / Micro BSZ 1024/512 2048/512 2048/1024 Window Size k 2500 2000 1000 Data Composition Detail Caption & OCR Slim Pajama Train Set Slim Pajama Train Set Search Step Ratio [0.1, 0.4] [0.1, 0.35] [0.1, 0.35] We set the upscaling factor α, downscaling factor β, and decaying factor λ as 3, 2, and 0.99 in all experiments for LR adjustment effectiveness.