Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Beyond Fixed Length: Bucket Pre-training is All You Need
Authors: Qing Yang, Qiyao Peng, Hongtao Liu, Kai Liu, Bing Qin, Ting Liu
IJCAI 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments and the results demonstrate that our proposed method significantly enhances both the efficiency and effectiveness of LLM pre-training. To comprehensively evaluate our proposed multi-bucket data composition method, we conduct experiments to address the following research questions: RQ1: How does our multi-bucket method improve data composition quality? ... RQ2: Does our proposed method lead to better model performance in LLM pre-training? ... Table 1 presents the comparison of different methods across three data composition quality metrics. ... Table 2 presents the evaluation results across seven standard NLP benchmarks. Our Bucket LLM consistently outperforms all baseline methods across all tasks. |
| Researcher Affiliation | Collaboration | 1Harbin Institute of Technology, Harbin, Heilongjiang, China 2Du Xiaoman Financial Technology, Beijing, China 3Tianjin University, Tianjin, China |
| Pseudocode | Yes | Algorithm 1 Bucket LLM Require: All Documents DS, preset bucket set B with capacities, padding threshold P, Pool Size S Ensure: All Training buckets T B ... Algorithm 2 Generate_a_Bucket Require: A Documents Pool D, preset bucket set B with capacities, padding threshold P Ensure: A Training bucket T, updated Documents Pool D |
| Open Source Code | Yes | Our proposed method has been adopted in the Du Xiaoman Xuan Yuan series of financial large language models at https: //github.com/Duxiaoman-DI/Xuan Yuan. |
| Open Datasets | Yes | Datasets We utilize Fine Web Edu [Penedo et al., 2024], a high-quality pre-training dataset derived from Common Crawl. |
| Dataset Splits | No | We utilize Fine Web Edu [Penedo et al., 2024], a high-quality pre-training dataset derived from Common Crawl. We conduct our experiments on a representative subset consisting of 100B tokens (approximately 98M documents). ... We evaluate each model on a comprehensive set of standard benchmarks under Lighteval framework1, including arc-challenge [Clark et al., 2018], arc-easy [Clark et al., 2018], commonsense-qa [Talmor et al., 2019], hellaswag [Zellers et al., 2019], mmluaverage [Hendrycks et al., 2020], openbook-qa [Mihaylov et al., 2018], and piqa [Bisk et al., 2020]. |
| Hardware Specification | Yes | All models are trained using the Deep Speed framework on 8 nodes, each equipped with 8 NVIDIA A800 GPUs (64 GPUs in total). |
| Software Dependencies | No | All models are trained using the Deep Speed framework on 8 nodes... For comparison, we select three widely used fixed-length baselines (Fixed-2048, Fixed-4096, and Fixed-8192) and the DD method. All models are trained using the Deep Speed framework... We conduct extensive pre-training experiments using a 1B parameter model based on the Llama3.1 architecture [Dubey et al., 2024]. |
| Experiment Setup | Yes | We maintain consistent training configurations across all methods, including an initial learning rate of 2e-4. ... Table 3 presents the distribution of training data across different bucket sizes and their corresponding training throughput. ... bucket data ratio batch size speed (tokens/s) 1,024 6.52% 48 16,793 (+22.62%) 2,048 16.58% 24 16,263 (+18.74%) 4,096 32.34% 12 15,505 (+13.21%) 8,192 27.17% 6 13,696 (+0.00%) 16,384 17.39% 3 11,129 (-18.74%) |