Are Emergent Abilities of Large Language Models a Mirage?
Authors: Rylan Schaeffer, Brando Miranda, Sanmi Koyejo
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present our alternative explanation in a simple mathematical model, then test it in three complementary ways: we (1) make, test and confirm three predictions on the effect of metric choice using the Instruct GPT/GPT-3 family on tasks with claimed emergent abilities; (2) make, test and confirm two predictions about metric choices in a meta-analysis of emergent abilities on the Beyond the Imitation Game Benchmark (BIG-Bench); and (3) show how to choose metrics to produce never-before-seen seemingly emergent abilities in multiple vision tasks across diverse deep network architectures. |
| Researcher Affiliation | Academia | Rylan Schaeffer Computer Science Stanford University rschaef@cs.stanford.edu Brando Miranda Computer Science Stanford University brando9@cs.stanford.edu Sanmi Koyejo Computer Science Stanford University sanmi@cs.stanford.edu |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper does not include an unambiguous statement or a direct link to the source code for the methodology it describes. |
| Open Datasets | Yes | We first induce an emergent ability to reconstruct images in shallow (i.e., single hidden layer) nonlinear autoencoders trained on CIFAR100 natural images [21]. ... Omniglot handwritten characters [22]. ... Le Net convolutional neural network family [24], trained on the MNIST handwritten digits dataset [23]. |
| Dataset Splits | No | The paper mentions using well-known datasets and performing experiments, but it does not provide specific details about train/validation/test dataset splits (e.g., percentages, sample counts, or explicit splitting methodology) for its own experiments. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used to run the experiments. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software libraries, frameworks, or tools used in the experiments. |
| Experiment Setup | Yes | To test these predictions, we collected outputs from the Instruct GPT/GPT-3 family on two tasks: 2-shot multiplication between two 2-digit integers and 2-shot addition between two 4-digit integers. ... Target Str Len 1 2 3 4 5 Temp 0.0 1.0 (from Figure 3 and 4 captions) |