Predicting Emergent Abilities with Infinite Resolution Evaluation

Authors: Shengding Hu, Xin Liu, Xu Han, Xinrong Zhang, Chaoqun He, Weilin Zhao, Yankai Lin, Ning Ding, Zebin Ou, Guoyang Zeng, Zhiyuan Liu, Maosong Sun

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct a quantitative investigation into the scaling law of task performance. The investigation contains two parts. Firstly, a strict task scaling law that is not conventionally known to exist, is identified, enhancing the predictability of task performances.
Researcher Affiliation Collaboration 1Department of Computer Science and Technology, Tsinghua University 2Beijing Language and Culture University. 3Shanghai Artificial Intelligence Laboratory 4Renmin University of China. 5Zhihu Inc. 6Modelbest Inc.
Pseudocode No No explicit pseudocode or algorithm blocks are provided in the paper.
Open Source Code Yes We will open-source and all evaluation scripts for reference.
Open Datasets Yes We select Human Eval (Chen et al., 2021), Emoji Movie, and Date Understanding (Srivastava et al., 2022) as the evaluation tasks.
Dataset Splits No The paper describes selection of test instances and few-shot contexts for evaluation tasks, and mentions pre-training corpora like Star Coder and Pile, but does not explicitly provide training/validation/test dataset splits with percentages or counts for the models being trained.
Hardware Specification No The paper does not explicitly state the specific hardware used for running its experiments, such as GPU or CPU models.
Software Dependencies No The paper refers to general software like 'Transformer-based language models' and tools like 'NLTK' and 'GPT-4', but does not provide specific version numbers for software dependencies needed for reproducibility.
Experiment Setup Yes The maximum learning rate is consistently fixed at 0.01 across varying model scales, with no significant loss explosion at this rate. This stability is potentially attributed to our normalization strategies (Yang et al., 2022) and increased batch size across scales. Echoing findings from Hoffmann et al. (2022), we ascertain that for training LLMs up to a specific end step, the optimal cycle length of the cosine learning rate scheduler is equivalent to the end step.