The Transient Nature of Emergent In-Context Learning in Transformers

Authors: Aaditya Singh, Stephanie Chan, Ted Moskovitz, Erin Grant, Andrew Saxe, Felix Hill

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We train transformers on synthetic data designed so that both ICL and in-weights learning (IWL) strategies can lead to correct predictions. We find that ICL first emerges, then disappears and gives way to IWL, all while the training loss decreases, indicating an asymptotic preference for IWL. The transient nature of ICL is observed in transformers across a range of model sizes and datasets, raising the question of how much to overtrain transformers when seeking compact, cheaper-to-run models. We find that L2 regularization may offer a path to more persistent ICL that removes the need for early stopping based on ICL-style validation tasks. Finally, we present initial evidence that ICL transience may be caused by competition between ICL and IWL circuits. 2 Experimental Setup
Researcher Affiliation Collaboration Aaditya K. Singh Gatsby Unit, UCL Stephanie C.Y. Chan Google DeepMind Ted Moskovitz Gatsby Unit, UCL Erin Grant Gatsby Unit & SWC, UCL Andrew M. Saxe Gatsby Unit & SWC, UCL Felix Hill Google DeepMind
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks labeled 'Pseudocode' or 'Algorithm', nor are there any code-like formatted procedures.
Open Source Code Yes Source code can be found at github.com/google-deepmind/emergent_in_context_learning and github.com/aadityasingh/icl-transience.
Open Datasets Yes Our data generators are primarily built from the Omniglot dataset, a standard benchmark for few-shot learning, which consists of images of handwritten characters from different alphabets [23, MIT License].
Dataset Splits No The paper distinguishes between 'Training sequences' and 'Evaluation sequences' (ICL evaluation, IWL evaluation), but does not explicitly describe a 'validation' dataset split with percentages, counts, or a specific split methodology for hyperparameter tuning separate from the test/evaluation sets.
Hardware Specification Yes Experiments were run for up to 5e7 training steps with batch size 32, on 16 TPU v2 or v3 cores.
Software Dependencies No The paper mentions using the 'FAISS library' for clustering and 'Adam' optimizer, but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes The default settings consist of 12 layers, with an embedding dimension of 64 and an additive, sinusoidal positional encoding scheme [5]... Experiments were run for up to 5e7 training steps with batch size 32... They were trained using Adam [25] (with default parameters of β1 = 0.9, β2 = 0.999) and a learning rate schedule with a linear warmup up to a maximum learning rate of 3e-4 at 4,000 steps, followed by an inverse square root decay. L2 regularization was implemented by adding the squared weights of the model (excluding batch norm parameters) to the loss term. All experiments were run with 2 seeds each.