Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

U-shaped and Inverted-U Scaling behind Emergent Abilities of Large Language Models

Authors: Tung-Yu Wu, Melody Lo

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we investigate the phenomenon by grouping questions based on difficulty level and provide a possible explanation for emergent abilities. Specifically, we observe U-shaped scaling for hard questions and inverted-U scaling followed by steady improvement for easy questions. ... Experimental results on three iconic datasets show its effectiveness.
Researcher Affiliation	Academia	Tung-Yu Wu, Melody Lo National Taiwan University EMAIL
Pseudocode	No	The paper describes a "Slice-and-Sandwich pipeline" but does not present it in a formal pseudocode or algorithm block. The steps are described in narrative text and illustrated with figures showing the process flow, but not in a structured code-like format.
Open Source Code	Yes	Our code is publicly available at https://github.com/tony10101105/Exp Emergence.
Open Datasets	Yes	Fig. 1 shows the evaluation result of 56 LLMs with diverse training compute on the MMLU benchmark, whose 14042 questions are clustered into 10 groups based on their difficulty levels... Fig. 2: The accuracy, TC Brier Score, U-Shaped and inverted-U scaling on the Persian-QA dataset in BIG-bench (Srivastava et al., 2023). ... Fig. 3: The accuracy, TC Brier Score, U-shaped and inverted-U scaling on the arithmetic dataset in BIG-bench (Srivastava et al., 2023).
Dataset Splits	No	The paper discusses splitting models into a 'training set' (models smaller than the emergence threshold T) and a 'testing set' (larger models) for fitting the scaling trends of their proposed Slice-and-Sandwich pipeline. However, it does not provide explicit train/test/validation splits for the MMLU, Persian-QA, or arithmetic datasets themselves, which are used for evaluation.
Hardware Specification	Yes	The evaluation time of each task varies from several hours to several days on 2 NVIDIA RTX A6000
Software Dependencies	No	The paper mentions using "LM Evaluation Harness (Gao et al., 2024)" and "FP16 precision". However, specific version numbers for the LM Evaluation Harness or any other software libraries are not provided.
Experiment Setup	Yes	We use T = 1.5, 1.8, and 2.3 as the emergence threshold for the MMLU, arithmetic, and Persian-QA dataset, respectively. ... We adopt 5-shot inference on the MMLU benchmark, 1-shot inference on the ARC and Hella Swag dataset, and 2-shot inference on Persian-QA, arithmetic, Hindu knowledge, conceptual combinations, analogical similarity, and abstract narrative understanding datasets. ... we adopt the polynomial order=5 and 2 for the easy and hard question groups, respectively.