Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Are Emergent Abilities of Large Language Models a Mirage?
Authors: Rylan Schaeffer, Brando Miranda, Sanmi Koyejo
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present our alternative explanation in a simple mathematical model, then test it in three complementary ways: we (1) make, test and confirm three predictions on the effect of metric choice using the Instruct GPT/GPT-3 family on tasks with claimed emergent abilities; (2) make, test and confirm two predictions about metric choices in a meta-analysis of emergent abilities on the Beyond the Imitation Game Benchmark (BIG-Bench); and (3) show how to choose metrics to produce never-before-seen seemingly emergent abilities in multiple vision tasks across diverse deep network architectures. |
| Researcher Affiliation | Academia | Rylan Schaeffer Computer Science Stanford University EMAIL Brando Miranda Computer Science Stanford University EMAIL Sanmi Koyejo Computer Science Stanford University EMAIL |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper does not include an unambiguous statement or a direct link to the source code for the methodology it describes. |
| Open Datasets | Yes | We first induce an emergent ability to reconstruct images in shallow (i.e., single hidden layer) nonlinear autoencoders trained on CIFAR100 natural images [21]. ... Omniglot handwritten characters [22]. ... Le Net convolutional neural network family [24], trained on the MNIST handwritten digits dataset [23]. |
| Dataset Splits | No | The paper mentions using well-known datasets and performing experiments, but it does not provide specific details about train/validation/test dataset splits (e.g., percentages, sample counts, or explicit splitting methodology) for its own experiments. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used to run the experiments. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software libraries, frameworks, or tools used in the experiments. |
| Experiment Setup | Yes | To test these predictions, we collected outputs from the Instruct GPT/GPT-3 family on two tasks: 2-shot multiplication between two 2-digit integers and 2-shot addition between two 4-digit integers. ... Target Str Len 1 2 3 4 5 Temp 0.0 1.0 (from Figure 3 and 4 captions) |