Understanding Emergent Abilities of Language Models from the Loss Perspective
Authors: Zhengxiao Du, Aohan Zeng, Yuxiao Dong, Jie Tang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we propose to study emergent abilities from the perspective of pre-training loss instead of model size or training compute. To examine the relationship between the pre-training loss of LMs and their performance, we pre-train more than 30 LMs of varied model and data sizes from scratch, using a fixed data corpus, tokenization, and model architecture. Their downstream performance is evaluated on 12 diverse datasets covering different tasks, languages, prompting types, and answer forms. We demonstrate that the pre-training loss of an LM is predictive of its performance on downstream tasks, regardless of its model size or data size. |
| Researcher Affiliation | Collaboration | Zhengxiao Du1,2, Aohan Zeng1,2, Yuxiao Dong2, Jie Tang2 1Zhipu AI 2Tsinghua University {zx-du20,zah22}@mails.tsinghua.edu.cn |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | No | Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: The training code is Megatron-LM 2. The pre-training dataset cannot be released since it contains proprietary content. |
| Open Datasets | Yes | Our pre-training corpus is a mixture of English and Chinese documents. The ratio of English tokens to Chinese tokens in the pre-training corpus is 4:1. Both the English and Chinese corpora consist of webpages, wikipedia, books, and papers. The distribution of different sources in the English corpus is shown in Table 3. The distribution and processing pipeline are similar to Redpajama [13]. |
| Dataset Splits | Yes | For Chinese datasets, we use the validation split when the ground labels are always available. For CLUEWSC, the size of the validation set is too small (100), so we combine the train and validation splits. |
| Hardware Specification | Yes | All the models are trained on DGX-A100 GPU (8x80G) servers. The 1.5B, 6B, and 32B models in Section 2.3 take 8 days on 256 A100 GPUs, 8 days on 1024 A100 GPUs, and 20 days on 2048 A100 GPUs respectively. |
| Software Dependencies | No | We tokenize the data with the byte pair encoding (BPE) algorithm [47] in the Sentence Piece package [30]. The optimizer is Adam W [35] with β1 = 0.9 and β2 = 0.95. |
| Experiment Setup | Yes | The hyperparameters for training of 1.5B, 6B, and 32B models are shown in Table 4 (Appendix). The hyperparameters for training of smaller models are shown in Table 5 (Appendix). The sequence length is 2048 and the optimizer is Adam W [35] with β1 = 0.9 and β2 = 0.95. |