Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
U-shaped and Inverted-U Scaling behind Emergent Abilities of Large Language Models
Authors: Tung-Yu Wu, Melody Lo
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we investigate the phenomenon by grouping questions based on difficulty level and provide a possible explanation for emergent abilities. Specifically, we observe U-shaped scaling for hard questions and inverted-U scaling followed by steady improvement for easy questions. ... Experimental results on three iconic datasets show its effectiveness. |
| Researcher Affiliation | Academia | Tung-Yu Wu, Melody Lo National Taiwan University EMAIL |
| Pseudocode | No | The paper describes a "Slice-and-Sandwich pipeline" but does not present it in a formal pseudocode or algorithm block. The steps are described in narrative text and illustrated with figures showing the process flow, but not in a structured code-like format. |
| Open Source Code | Yes | Our code is publicly available at https://github.com/tony10101105/Exp Emergence. |
| Open Datasets | Yes | Fig. 1 shows the evaluation result of 56 LLMs with diverse training compute on the MMLU benchmark, whose 14042 questions are clustered into 10 groups based on their difficulty levels... Fig. 2: The accuracy, TC Brier Score, U-Shaped and inverted-U scaling on the Persian-QA dataset in BIG-bench (Srivastava et al., 2023). ... Fig. 3: The accuracy, TC Brier Score, U-shaped and inverted-U scaling on the arithmetic dataset in BIG-bench (Srivastava et al., 2023). |
| Dataset Splits | No | The paper discusses splitting models into a 'training set' (models smaller than the emergence threshold T) and a 'testing set' (larger models) for fitting the scaling trends of their proposed Slice-and-Sandwich pipeline. However, it does not provide explicit train/test/validation splits for the MMLU, Persian-QA, or arithmetic datasets themselves, which are used for evaluation. |
| Hardware Specification | Yes | The evaluation time of each task varies from several hours to several days on 2 NVIDIA RTX A6000 |
| Software Dependencies | No | The paper mentions using "LM Evaluation Harness (Gao et al., 2024)" and "FP16 precision". However, specific version numbers for the LM Evaluation Harness or any other software libraries are not provided. |
| Experiment Setup | Yes | We use T = 1.5, 1.8, and 2.3 as the emergence threshold for the MMLU, arithmetic, and Persian-QA dataset, respectively. ... We adopt 5-shot inference on the MMLU benchmark, 1-shot inference on the ARC and Hella Swag dataset, and 2-shot inference on Persian-QA, arithmetic, Hindu knowledge, conceptual combinations, analogical similarity, and abstract narrative understanding datasets. ... we adopt the polynomial order=5 and 2 for the easy and hard question groups, respectively. |