Selecting Large Language Model to Fine-tune via Rectified Scaling Law
Authors: Haowei Lin, Baizhou Huang, Haotian Ye, Qinyu Chen, Zihao Wang, Sujian Li, Jianzhu Ma, Xiaojun Wan, James Zou, Yitao Liang
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we fine-tune 30 LLMs on three datasets with a sufficiently wide range of dataset size, and illustrate the existence of the phase transition pattern during scaling fine-tuning. We demonstrate both theoretically and empirically why Equation (5) fail to fit the results. Based on our theoretical analysis, we introduce the concept of prelearned data size and establish a well-fitted Scaling Law by incorporating the pre-learned data size into existing laws. |
| Researcher Affiliation | Academia | 1Institute for Artificial Intelligence, Peking University 2Peking University 3Stanford University 4Tsinghua University. |
| Pseudocode | Yes | Algorithm 1 Accept then Stop (At S) Input: Training subset Ssub, Model M, parameters k, δ. 1: Initialize loss-size pair set P = {}. 2: while Ture do 3: Fine-tune M on Ssub and get its loss L. 4: if |P| k then 5: Fit a linear regression model f on P. 6: break if Is > δ. 7: end if 8: Add pair {log |Ssub|, log L} to P. 9: Sample new Ssub with half size from Ssub. 10: end while Return: Score of M as negative predicted log-loss on S, f(log(|S|). |
| Open Source Code | Yes | The project page is available at rectified-scaling-law.github.io. |
| Open Datasets | Yes | We consider machine translation (WMT19 English-Chinese (En-Zh) (Kocmi et al., 2022)), paragraph summarization (Gigaword (Rush et al., 2015)), and multi-task instruction tuning (FLAN (Wei et al., 2021)) as the downstream fine-tuning tasks. ... In our experiments, we use the FLAN Collection provided by Huggingface 5 and we choose the no-option split which requires the model to generate a free-form answer. |
| Dataset Splits | Yes | For the sake of the consistency of performance estimation in our study, we safely hold out a validation set and always use the average loss over this set as the estimation of L(M) for models fine-tuned on different Ssub. ... Table 5. Statistics of fine-tuning datasets ... FLAN ... Dataset Size (Train/Valid/Test) 2,320,656 / 10,000 / 10,000 ... WMT19 ... 25,982,455 / 3,981 / 3,981 ... Gigaword ... 3,795,957 / 8,000 / 8,000 |
| Hardware Specification | Yes | We run most of the experiments on clusters using NVIDIA A100s. |
| Software Dependencies | No | The paper mentions using "PyTorch" and the "Hugging Face library," as well as the "scipy" package. However, it does not specify exact version numbers for these software dependencies, which is required for reproducibility. |
| Experiment Setup | Yes | Hyper-parameter Values learning rate search on {1e-4, 3e-4, 5e-4, 1e-3} for small models < 700M, {3e-5, 5e-5, 1e-4, 3e-4} for large models > 700M batch size search on {64, 128, 256} training epoch 20 with early stopping (patience=3) optimizer AdamW weight decay 0.01 scheduler cosine warmup ratio 0.03 |