D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models

Authors: Haoran Que, Jiaheng Liu, Ge Zhang, Chenchen Zhang, Xingwei Qu, Yinghao Ma, Feiyu Duan, ZhiqiBai zhiqi, JiakaiWang , Yuanxing Zhang, Xu Tan, Jie Fu, Jiamang Wang, Lin Qu, Wenbo Su, Bo Zheng

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experimental results on six downstream domains demonstrate the effectiveness and generalizability of our proposed D-CPT Law and Cross-Domain D-CPT Law.
Researcher Affiliation Collaboration 1Taobao & Tmall Group of Alibaba, 2Alibaba Group, 3University of Waterloo 4University of Manchester, 5QMUL, 6HKUST, 7M-A-P
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks (clearly labeled algorithm sections or code-like formatted procedures).
Open Source Code No The code belongs to the company s intellectual property, but the data can be downloaded from open-source repositories.
Open Datasets Yes For general corpus, we use Dolma [48]. For example, the Dolma dataset can be downloaded from https://github.com/allenai/dolma. In the main text, we cited all the data sources.
Dataset Splits Yes Specifically, we test the validation loss every 1,000 steps 2 and the total training steps are 200k. Then, we establish 9 mixture ratios between general-corpus and domaincorpus as follows: {0:10, 1:9, 2:8, 3.3:6.7, 5:5, 6.7:3.3, 8:2, 9:1, 10:0}. We use 3-fold cross-validation to evaluate the model size generalizability of D-CPT Law
Hardware Specification Yes Our main experiment requires approximately 150k hours of runtime on a single A100.
Software Dependencies No The paper mentions software like 'MATLAB' and algorithms such as 'L-BFGS' but does not specify version numbers for any key software components or libraries.
Experiment Setup Yes Training Setup We follow Chinchilla [27] to fix model sizes and vary the number of training tokens for data point collection. Specifically, we test the validation loss every 1,000 steps 2 and the total training steps are 200k. Then, we establish 9 mixture ratios between general-corpus and domaincorpus as follows: {0:10, 1:9, 2:8, 3.3:6.7, 5:5, 6.7:3.3, 8:2, 9:1, 10:0}. Note that all experiments are conducted with the same learning rate schedule (Hyperparameters can be found in Appendix F.2). Table 12: The list of hyperparameters. Hyperparameters Value Warm-up Steps 0 Gradient Accumulation Steps 4 Train Batch Size Per Device 4 Max Sequence Length 2048 Learning Rate 3e-5 Learning Rate Scheduler cosine Numbers of GPUs 16