Data Distributional Properties Drive Emergent In-Context Learning in Transformers

Authors: Stephanie Chan, Adam Santoro, Andrew Lampinen, Jane Wang, Aaditya Singh, Pierre Richemond, James McClelland, Felix Hill

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we experimentally manipulated the distributional properties of the training data and measured the effects on in-context few-shot learning. We performed our experiments over data sequences sampled from a standard image-based few-shot dataset (the Omniglot dataset; Lake et al., 2019).
Researcher Affiliation Collaboration Stephanie C.Y. Chan Adam Santoro Andrew K. Lampinen Jane X. Wang Aaditya K. Singh University College London Pierre H. Richemond James L. Mc Clelland Deep Mind, Stanford University Felix Hill Deep Mind
Pseudocode No The paper describes its methods and experimental setup in narrative text and diagrams (e.g., Figure 1) but does not include any explicit pseudocode or algorithm blocks.
Open Source Code Yes Code is available at: https://github.com/deepmind/emergent_in_context_learning
Open Datasets Yes To investigate the factors that lead to in-context few-shot learning, we created training and evaluation sequences using the Omniglot dataset (Lake et al., 2019, MIT License), a standard image-label dataset for few-shot learning.
Dataset Splits Yes We evaluated trained models on two types of sequences, to measure (1) in-context learning and (2) in-weights learning.
Hardware Specification No The paper does not provide any specific details regarding the hardware specifications (e.g., GPU models, CPU types, memory) used to run the experiments. It only mentions the transformer and recurrent models.
Software Dependencies No The paper mentions using a ResNet for image embedding and a causal transformer model, citing relevant papers, but does not provide specific software version numbers for libraries or frameworks (e.g., Python, PyTorch, TensorFlow versions) that would be needed for replication.
Experiment Setup Yes Unless stated otherwise, we used a transformer with 12 layers and embedding size 64. The model was trained on a softmax cross-entropy loss on the prediction for the final (query) image.