Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Beyond Fixed Length: Bucket Pre-training is All You Need

Authors: Qing Yang, Qiyao Peng, Hongtao Liu, Kai Liu, Bing Qin, Ting Liu

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments and the results demonstrate that our proposed method significantly enhances both the efficiency and effectiveness of LLM pre-training. To comprehensively evaluate our proposed multi-bucket data composition method, we conduct experiments to address the following research questions: RQ1: How does our multi-bucket method improve data composition quality? ... RQ2: Does our proposed method lead to better model performance in LLM pre-training? ... Table 1 presents the comparison of different methods across three data composition quality metrics. ... Table 2 presents the evaluation results across seven standard NLP benchmarks. Our Bucket LLM consistently outperforms all baseline methods across all tasks.
Researcher Affiliation Collaboration 1Harbin Institute of Technology, Harbin, Heilongjiang, China 2Du Xiaoman Financial Technology, Beijing, China 3Tianjin University, Tianjin, China
Pseudocode Yes Algorithm 1 Bucket LLM Require: All Documents DS, preset bucket set B with capacities, padding threshold P, Pool Size S Ensure: All Training buckets T B ... Algorithm 2 Generate_a_Bucket Require: A Documents Pool D, preset bucket set B with capacities, padding threshold P Ensure: A Training bucket T, updated Documents Pool D
Open Source Code Yes Our proposed method has been adopted in the Du Xiaoman Xuan Yuan series of financial large language models at https: //github.com/Duxiaoman-DI/Xuan Yuan.
Open Datasets Yes Datasets We utilize Fine Web Edu [Penedo et al., 2024], a high-quality pre-training dataset derived from Common Crawl.
Dataset Splits No We utilize Fine Web Edu [Penedo et al., 2024], a high-quality pre-training dataset derived from Common Crawl. We conduct our experiments on a representative subset consisting of 100B tokens (approximately 98M documents). ... We evaluate each model on a comprehensive set of standard benchmarks under Lighteval framework1, including arc-challenge [Clark et al., 2018], arc-easy [Clark et al., 2018], commonsense-qa [Talmor et al., 2019], hellaswag [Zellers et al., 2019], mmluaverage [Hendrycks et al., 2020], openbook-qa [Mihaylov et al., 2018], and piqa [Bisk et al., 2020].
Hardware Specification Yes All models are trained using the Deep Speed framework on 8 nodes, each equipped with 8 NVIDIA A800 GPUs (64 GPUs in total).
Software Dependencies No All models are trained using the Deep Speed framework on 8 nodes... For comparison, we select three widely used fixed-length baselines (Fixed-2048, Fixed-4096, and Fixed-8192) and the DD method. All models are trained using the Deep Speed framework... We conduct extensive pre-training experiments using a 1B parameter model based on the Llama3.1 architecture [Dubey et al., 2024].
Experiment Setup Yes We maintain consistent training configurations across all methods, including an initial learning rate of 2e-4. ... Table 3 presents the distribution of training data across different bucket sizes and their corresponding training throughput. ... bucket data ratio batch size speed (tokens/s) 1,024 6.52% 48 16,793 (+22.62%) 2,048 16.58% 24 16,263 (+18.74%) 4,096 32.34% 12 15,505 (+13.21%) 8,192 27.17% 6 13,696 (+0.00%) 16,384 17.39% 3 11,129 (-18.74%)