Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Hierachical Balance Packing: Towards Efficient Supervised Fine-tuning for Long-Context LLM

Authors: Yongqiang Yao, Jingru Tan, Kaihuan Liang, Feizhao Zhang, Jiahao Hu, Shuo Wu, Yazhe Niu, Ruihao Gong, Dahua Lin, Ningyi Xu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive experiments demonstrate that our method significantly reduces training time over multiple datasets and open-source models while maintaining strong performance. For the largest Deep Seek-V2 (236B) Mo E model, our method speeds up the training by 2.4 with competitive performance.
Researcher Affiliation	Collaboration	1Shanghai Jiao Tong University 2Sense Time Research 3Central South University 4The Chinese University of Hong Kong 5Beihang University
Pseudocode	Yes	Algorithm 1 Hierarchical Groups Auto Selection Algorithm 2 Balance Packing Algorithm 3 Find Best Sp Ckpt Function Algorithm 4 Greedy Profile Ckpt Algorithm 5 Group Data Function Algorithm 6 Greedy Fill Function Algorithm 7 Attention Balance Sort Function
Open Source Code	No	Codes will be released at https://github.com/Model TC/HBP.
Open Datasets	Yes	We use large-scale datasets: Tulu3 (32K)[14] for general tasks and Long Cite (128K)[15] for long-context tasks. ... Experiments on various models at different scales like LLama3.1-8B [1], Qwen2.5-32B [2], Qwen2.5-72B [2], and Deep Seek-V2 (236B) demonstrate consistent improvements... To verify the generalization of our method, we conducted experiments on different datasets (Open Hermes[16], Long Writer[23]).
Dataset Splits	No	The paper mentions using 'Tulu3 (32K)[14]' and 'Long Cite (128K)[15]' datasets and states 'For the Longsite dataset, approximately 2k samples are uniformly sampled.' However, it does not specify explicit training, validation, or test splits by percentage or absolute counts for any of the datasets used to reproduce the experiments.
Hardware Specification	Yes	Most models are trained on 32x H100 80GB GPUs using the Deep Speed [28], while Deep Seek-V2 (236B) is trained with the Megatron-LM [29] with 256x H100 80G GPUs.
Software Dependencies	No	The paper mentions software frameworks like 'Deep Speed [28]' and 'Megatron-LM [29]', and a specific approach 'Deep Speed-Ulysses s [10]', but it does not provide specific version numbers for these software components or any other libraries used.
Experiment Setup	Yes	We use a learning rate of 1e-5, weight decay of 0.01, and adopt Adam W as our optimizer.