D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models
Authors: Haoran Que, Jiaheng Liu, Ge Zhang, Chenchen Zhang, Xingwei Qu, Yinghao Ma, Feiyu Duan, ZhiqiBai zhiqi, JiakaiWang , Yuanxing Zhang, Xu Tan, Jie Fu, Jiamang Wang, Lin Qu, Wenbo Su, Bo Zheng
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive experimental results on six downstream domains demonstrate the effectiveness and generalizability of our proposed D-CPT Law and Cross-Domain D-CPT Law. |
| Researcher Affiliation | Collaboration | 1Taobao & Tmall Group of Alibaba, 2Alibaba Group, 3University of Waterloo 4University of Manchester, 5QMUL, 6HKUST, 7M-A-P |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks (clearly labeled algorithm sections or code-like formatted procedures). |
| Open Source Code | No | The code belongs to the company s intellectual property, but the data can be downloaded from open-source repositories. |
| Open Datasets | Yes | For general corpus, we use Dolma [48]. For example, the Dolma dataset can be downloaded from https://github.com/allenai/dolma. In the main text, we cited all the data sources. |
| Dataset Splits | Yes | Specifically, we test the validation loss every 1,000 steps 2 and the total training steps are 200k. Then, we establish 9 mixture ratios between general-corpus and domaincorpus as follows: {0:10, 1:9, 2:8, 3.3:6.7, 5:5, 6.7:3.3, 8:2, 9:1, 10:0}. We use 3-fold cross-validation to evaluate the model size generalizability of D-CPT Law |
| Hardware Specification | Yes | Our main experiment requires approximately 150k hours of runtime on a single A100. |
| Software Dependencies | No | The paper mentions software like 'MATLAB' and algorithms such as 'L-BFGS' but does not specify version numbers for any key software components or libraries. |
| Experiment Setup | Yes | Training Setup We follow Chinchilla [27] to fix model sizes and vary the number of training tokens for data point collection. Specifically, we test the validation loss every 1,000 steps 2 and the total training steps are 200k. Then, we establish 9 mixture ratios between general-corpus and domaincorpus as follows: {0:10, 1:9, 2:8, 3.3:6.7, 5:5, 6.7:3.3, 8:2, 9:1, 10:0}. Note that all experiments are conducted with the same learning rate schedule (Hyperparameters can be found in Appendix F.2). Table 12: The list of hyperparameters. Hyperparameters Value Warm-up Steps 0 Gradient Accumulation Steps 4 Train Batch Size Per Device 4 Max Sequence Length 2048 Learning Rate 3e-5 Learning Rate Scheduler cosine Numbers of GPUs 16 |