reproducibilityindex.ai

Predicting Emergent Abilities with Infinite Resolution Evaluation

Authors: Shengding Hu, Xin Liu, Xu Han, Xinrong Zhang, Chaoqun He, Weilin Zhao, Yankai Lin, Ning Ding, Zebin Ou, Guoyang Zeng, Zhiyuan Liu, Maosong Sun

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct a quantitative investigation into the scaling law of task performance. The investigation contains two parts. Firstly, a strict task scaling law that is not conventionally known to exist, is identified, enhancing the predictability of task performances.
Researcher Affiliation	Collaboration	1Department of Computer Science and Technology, Tsinghua University 2Beijing Language and Culture University. 3Shanghai Artificial Intelligence Laboratory 4Renmin University of China. 5Zhihu Inc. 6Modelbest Inc.
Pseudocode	No	No explicit pseudocode or algorithm blocks are provided in the paper.
Open Source Code	Yes	We will open-source and all evaluation scripts for reference.
Open Datasets	Yes	We select Human Eval (Chen et al., 2021), Emoji Movie, and Date Understanding (Srivastava et al., 2022) as the evaluation tasks.
Dataset Splits	No	The paper describes selection of test instances and few-shot contexts for evaluation tasks, and mentions pre-training corpora like Star Coder and Pile, but does not explicitly provide training/validation/test dataset splits with percentages or counts for the models being trained.
Hardware Specification	No	The paper does not explicitly state the specific hardware used for running its experiments, such as GPU or CPU models.
Software Dependencies	No	The paper refers to general software like 'Transformer-based language models' and tools like 'NLTK' and 'GPT-4', but does not provide specific version numbers for software dependencies needed for reproducibility.
Experiment Setup	Yes	The maximum learning rate is consistently fixed at 0.01 across varying model scales, with no significant loss explosion at this rate. This stability is potentially attributed to our normalization strategies (Yang et al., 2022) and increased batch size across scales. Echoing findings from Hoffmann et al. (2022), we ascertain that for training LLMs up to a specific end step, the optimal cycle length of the cosine learning rate scheduler is equivalent to the end step.