Are Emergent Abilities of Large Language Models a Mirage?

Authors: Rylan Schaeffer, Brando Miranda, Sanmi Koyejo

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present our alternative explanation in a simple mathematical model, then test it in three complementary ways: we (1) make, test and confirm three predictions on the effect of metric choice using the Instruct GPT/GPT-3 family on tasks with claimed emergent abilities; (2) make, test and confirm two predictions about metric choices in a meta-analysis of emergent abilities on the Beyond the Imitation Game Benchmark (BIG-Bench); and (3) show how to choose metrics to produce never-before-seen seemingly emergent abilities in multiple vision tasks across diverse deep network architectures.
Researcher Affiliation Academia Rylan Schaeffer Computer Science Stanford University rschaef@cs.stanford.edu Brando Miranda Computer Science Stanford University brando9@cs.stanford.edu Sanmi Koyejo Computer Science Stanford University sanmi@cs.stanford.edu
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper does not include an unambiguous statement or a direct link to the source code for the methodology it describes.
Open Datasets Yes We first induce an emergent ability to reconstruct images in shallow (i.e., single hidden layer) nonlinear autoencoders trained on CIFAR100 natural images [21]. ... Omniglot handwritten characters [22]. ... Le Net convolutional neural network family [24], trained on the MNIST handwritten digits dataset [23].
Dataset Splits No The paper mentions using well-known datasets and performing experiments, but it does not provide specific details about train/validation/test dataset splits (e.g., percentages, sample counts, or explicit splitting methodology) for its own experiments.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used to run the experiments.
Software Dependencies No The paper does not provide specific version numbers for any software libraries, frameworks, or tools used in the experiments.
Experiment Setup Yes To test these predictions, we collected outputs from the Instruct GPT/GPT-3 family on two tasks: 2-shot multiplication between two 2-digit integers and 2-shot addition between two 4-digit integers. ... Target Str Len 1 2 3 4 5 Temp 0.0 1.0 (from Figure 3 and 4 captions)