Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Hierachical Balance Packing: Towards Efficient Supervised Fine-tuning for Long-Context LLM

Authors: Yongqiang Yao, Jingru Tan, Kaihuan Liang, Feizhao Zhang, Jiahao Hu, Shuo Wu, Yazhe Niu, Ruihao Gong, Dahua Lin, Ningyi Xu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive experiments demonstrate that our method significantly reduces training time over multiple datasets and open-source models while maintaining strong performance. For the largest Deep Seek-V2 (236B) Mo E model, our method speeds up the training by 2.4 with competitive performance.
Researcher Affiliation Collaboration 1Shanghai Jiao Tong University 2Sense Time Research 3Central South University 4The Chinese University of Hong Kong 5Beihang University
Pseudocode Yes Algorithm 1 Hierarchical Groups Auto Selection Algorithm 2 Balance Packing Algorithm 3 Find Best Sp Ckpt Function Algorithm 4 Greedy Profile Ckpt Algorithm 5 Group Data Function Algorithm 6 Greedy Fill Function Algorithm 7 Attention Balance Sort Function
Open Source Code No Codes will be released at https://github.com/Model TC/HBP.
Open Datasets Yes We use large-scale datasets: Tulu3 (32K)[14] for general tasks and Long Cite (128K)[15] for long-context tasks. ... Experiments on various models at different scales like LLama3.1-8B [1], Qwen2.5-32B [2], Qwen2.5-72B [2], and Deep Seek-V2 (236B) demonstrate consistent improvements... To verify the generalization of our method, we conducted experiments on different datasets (Open Hermes[16], Long Writer[23]).
Dataset Splits No The paper mentions using 'Tulu3 (32K)[14]' and 'Long Cite (128K)[15]' datasets and states 'For the Longsite dataset, approximately 2k samples are uniformly sampled.' However, it does not specify explicit training, validation, or test splits by percentage or absolute counts for any of the datasets used to reproduce the experiments.
Hardware Specification Yes Most models are trained on 32x H100 80GB GPUs using the Deep Speed [28], while Deep Seek-V2 (236B) is trained with the Megatron-LM [29] with 256x H100 80G GPUs.
Software Dependencies No The paper mentions software frameworks like 'Deep Speed [28]' and 'Megatron-LM [29]', and a specific approach 'Deep Speed-Ulysses s [10]', but it does not provide specific version numbers for these software components or any other libraries used.
Experiment Setup Yes We use a learning rate of 1e-5, weight decay of 0.01, and adopt Adam W as our optimizer.