Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SkyLadder: Better and Faster Pretraining via Context Window Scheduling

Authors: Tongyao Zhu, Qian Liu, Haonan Wang, Shiqi Chen, Xiangming Gu, Tianyu Pang, Min-Yen Kan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments, we pretrain 1B-parameter models (up to 32K context) and 3B-parameter models (8K context) on 100B tokens, demonstrating that Sky Ladder yields consistent gains of up to 3.7% on common benchmarks, while achieving up to 22% faster training speeds compared to baselines2. ... Empirical results on 1B-parameter models (up to 32K context window) and 3B-parameter models (up to 8K context window) on 100B tokens demonstrate that Sky Ladder outperforms naive long-context pretraining baselines, in both shortand long-context evaluation tasks.
Researcher Affiliation Collaboration Tongyao Zhu1,2 Qian Liu2 Haonan Wang1 Shiqi Chen3 Xiangming Gu1 Tianyu Pang2 Min-Yen Kan1 1National University of Singapore 2Sea AI Lab 3City University of Hong Kong
Pseudocode Yes A.5 Implementation We provide the pseudocode for implementing Sky Ladder with Flash Attention 2 [7]. The only change is to apply local causal masking with size w, and combine them with the original document boundaries under the Intra Doc scenario. It can easily be integrated into any model before calculating attention. The rest of the training pipeline remains unchanged.
Open Source Code Yes 2Project code is at https://github.com/sail-sg/Sky Ladder ... We include the code in the supplementary materials, and will open source the code upon acceptance.
Open Datasets Yes Given the substantial computational cost associated with retrieval in semantic packing, we randomly select around 30B tokens from the Common Crawl (CC) subset of the Slim Pajama dataset [46] as the pretraining corpus. ... To address potential concerns that the benefits observed in short contexts may stem from the high level of noise in CC, we conduct additional experiments using the Fine Web-Pro dataset [65], a carefully curated high-quality dataset containing 100B tokens. ... We mainly use the following public datasets or codebases in this paper: Slim Pajama [46] following the Common Crawl Foundation Terms of Use3, Fine Web-Pro [65] with an ODC-By 1.0 license, and Tiny Llama [61] with an Apache 2.0 License.
Dataset Splits No Evaluation. For all model sizes, we use perplexity (PPL) on validation documents from the original dataset as a key metric, in line with established practices [10, 24, 17]. Note that when comparing models across different context windows (e.g., a 2K-context model and an 8K-context model), we must ensure the evaluation sequence fits within the shorter model s context window to maintain a fair comparison.
Hardware Specification Yes We conducted all of our experiments for models with 1B size on an internal cluster of NVIDIA A100 nodes with 40G memory. Experiments with 3B models were conducted on H100 nodes.
Software Dependencies No We pretrain models from scratch using the Tiny Llama codebase [61]...All baseline and Sky Ladder models are implemented with Flash Attention 2 [7] (pseudocode in A.5). ... Optimizer Adam W
Experiment Setup Yes We set ws = 32 and α = 1/8 by default... We fix all other hyperparameters, such as the learning rate schedule, batch size, etc., for fair comparison. ... Table 12: Hyperparameters setup for pretraining the language models. All pretrained models follow the same structure. Parameter Value Optimizer Adam W Adam W-β1 0.9 Adam W-β2 0.95 Learning Rate Schedule Cosine Peak Learning Rate 4e-4 Minimum Learning Rate 4e-5 Warmup Steps 2000 Gradient Norm Clipping 1 Total Steps 100,000 Global Batch Size 1,048,576 (220) tokens Weight Decay 0.1