Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
AdaLRS: Loss-Guided Adaptive Learning Rate Search for Efficient Foundation Model Pretraining
Authors: Hongyuan Dong, Dingkang Yang, Xiao Liang, ChaoFeng, Ran Jiao
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide theoretical and experimental analyzes to show that foundation model pretraining loss and its descent velocity are both convex and share the same optimal learning rate. Relying solely on training loss dynamics, Ada LRS involves few extra computations to guide the search process, and its convergence is guaranteed via theoretical analysis. Experiments on both LLM and VLM pretraining show that Ada LRS adjusts suboptimal learning rates to the neighborhood of optimum with marked efficiency and effectiveness, with model performance improved accordingly. |
| Researcher Affiliation | Industry | Hongyuan Dong, Dingkang Yang, Xiao Liang, Chao Feng , Jiao Ran Byte Dance Inc. EMAIL, EMAIL, EMAIL EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1: Adaptive Learning Rate Scheduling (Ada LRS) |
| Open Source Code | No | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We will release the proposed benchmark and code after the paper is accepted. |
| Open Datasets | Yes | We use a total of approximately 64M samples from the Slim Pajama [42] dataset for LLM training from scratch, with all samples shuffled randomly. For VLM pretraining, on the other hand, we adopt the model structure of SAIL-VL [16], with Intern Vi T-300M [13] and Qwen2.5-1.5B adopted as backbone models. We use a collection of detail caption and image OCR data to train the vision-to-language projector from scratch. Detail caption datasets are curated via a similar recaption procedure as described in SAIL-VL [16], while OCR data is collected from a series of opensource datasets [6, 45, 20]. |
| Dataset Splits | No | For LLM pretraining, we quantify the performance advantage of models trained with Ada LRS with final training loss and perplexity (PPL) computed on Slim Pajama [42] train, validation, and test splits. |
| Hardware Specification | Yes | Approximately 120B and 160B tokens are used for LLM and VLM pretraining, with roughly 10,000 and 20,000 910B NPU hours consumed for 2B and 7B model pretraining experiments. |
| Software Dependencies | No | The paper mentions using a 'cosine learning rate scheduler [32]' and 'WSD [22] scheduler', but does not specify software library names with version numbers for implementation. |
| Experiment Setup | Yes | Table 1: Detailed hyperparameters for the main experiments. Fit , Large , and Small refer appropriate, too large, and too small learning rates, respectively. BSZ stands for batch size. Hyperparameter Fit Large Small Learning Rate 2e 4 2e 3 2e 5 2e 4 2e 3 2e 5 8e 3 4e 1 2e 4 BSZ / Micro BSZ 1024/512 2048/512 2048/1024 Window Size k 2500 2000 1000 Data Composition Detail Caption & OCR Slim Pajama Train Set Slim Pajama Train Set Search Step Ratio [0.1, 0.4] [0.1, 0.35] [0.1, 0.35] We set the upscaling factor α, downscaling factor β, and decaying factor λ as 3, 2, and 0.99 in all experiments for LR adjustment effectiveness. |