Predicting Emergent Abilities with Infinite Resolution Evaluation
Authors: Shengding Hu, Xin Liu, Xu Han, Xinrong Zhang, Chaoqun He, Weilin Zhao, Yankai Lin, Ning Ding, Zebin Ou, Guoyang Zeng, Zhiyuan Liu, Maosong Sun
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct a quantitative investigation into the scaling law of task performance. The investigation contains two parts. Firstly, a strict task scaling law that is not conventionally known to exist, is identified, enhancing the predictability of task performances. |
| Researcher Affiliation | Collaboration | 1Department of Computer Science and Technology, Tsinghua University 2Beijing Language and Culture University. 3Shanghai Artificial Intelligence Laboratory 4Renmin University of China. 5Zhihu Inc. 6Modelbest Inc. |
| Pseudocode | No | No explicit pseudocode or algorithm blocks are provided in the paper. |
| Open Source Code | Yes | We will open-source and all evaluation scripts for reference. |
| Open Datasets | Yes | We select Human Eval (Chen et al., 2021), Emoji Movie, and Date Understanding (Srivastava et al., 2022) as the evaluation tasks. |
| Dataset Splits | No | The paper describes selection of test instances and few-shot contexts for evaluation tasks, and mentions pre-training corpora like Star Coder and Pile, but does not explicitly provide training/validation/test dataset splits with percentages or counts for the models being trained. |
| Hardware Specification | No | The paper does not explicitly state the specific hardware used for running its experiments, such as GPU or CPU models. |
| Software Dependencies | No | The paper refers to general software like 'Transformer-based language models' and tools like 'NLTK' and 'GPT-4', but does not provide specific version numbers for software dependencies needed for reproducibility. |
| Experiment Setup | Yes | The maximum learning rate is consistently fixed at 0.01 across varying model scales, with no significant loss explosion at this rate. This stability is potentially attributed to our normalization strategies (Yang et al., 2022) and increased batch size across scales. Echoing findings from Hoffmann et al. (2022), we ascertain that for training LLMs up to a specific end step, the optimal cycle length of the cosine learning rate scheduler is equivalent to the end step. |