Selecting Large Language Model to Fine-tune via Rectified Scaling Law

Authors: Haowei Lin, Baizhou Huang, Haotian Ye, Qinyu Chen, Zihao Wang, Sujian Li, Jianzhu Ma, Xiaojun Wan, James Zou, Yitao Liang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we fine-tune 30 LLMs on three datasets with a sufficiently wide range of dataset size, and illustrate the existence of the phase transition pattern during scaling fine-tuning. We demonstrate both theoretically and empirically why Equation (5) fail to fit the results. Based on our theoretical analysis, we introduce the concept of prelearned data size and establish a well-fitted Scaling Law by incorporating the pre-learned data size into existing laws.
Researcher Affiliation Academia 1Institute for Artificial Intelligence, Peking University 2Peking University 3Stanford University 4Tsinghua University.
Pseudocode Yes Algorithm 1 Accept then Stop (At S) Input: Training subset Ssub, Model M, parameters k, δ. 1: Initialize loss-size pair set P = {}. 2: while Ture do 3: Fine-tune M on Ssub and get its loss L. 4: if |P| k then 5: Fit a linear regression model f on P. 6: break if Is > δ. 7: end if 8: Add pair {log |Ssub|, log L} to P. 9: Sample new Ssub with half size from Ssub. 10: end while Return: Score of M as negative predicted log-loss on S, f(log(|S|).
Open Source Code Yes The project page is available at rectified-scaling-law.github.io.
Open Datasets Yes We consider machine translation (WMT19 English-Chinese (En-Zh) (Kocmi et al., 2022)), paragraph summarization (Gigaword (Rush et al., 2015)), and multi-task instruction tuning (FLAN (Wei et al., 2021)) as the downstream fine-tuning tasks. ... In our experiments, we use the FLAN Collection provided by Huggingface 5 and we choose the no-option split which requires the model to generate a free-form answer.
Dataset Splits Yes For the sake of the consistency of performance estimation in our study, we safely hold out a validation set and always use the average loss over this set as the estimation of L(M) for models fine-tuned on different Ssub. ... Table 5. Statistics of fine-tuning datasets ... FLAN ... Dataset Size (Train/Valid/Test) 2,320,656 / 10,000 / 10,000 ... WMT19 ... 25,982,455 / 3,981 / 3,981 ... Gigaword ... 3,795,957 / 8,000 / 8,000
Hardware Specification Yes We run most of the experiments on clusters using NVIDIA A100s.
Software Dependencies No The paper mentions using "PyTorch" and the "Hugging Face library," as well as the "scipy" package. However, it does not specify exact version numbers for these software dependencies, which is required for reproducibility.
Experiment Setup Yes Hyper-parameter Values learning rate search on {1e-4, 3e-4, 5e-4, 1e-3} for small models < 700M, {3e-5, 5e-5, 1e-4, 3e-4} for large models > 700M batch size search on {64, 128, 256} training epoch 20 with early stopping (patience=3) optimizer AdamW weight decay 0.01 scheduler cosine warmup ratio 0.03