Data Distributional Properties Drive Emergent In-Context Learning in Transformers
Authors: Stephanie Chan, Adam Santoro, Andrew Lampinen, Jane Wang, Aaditya Singh, Pierre Richemond, James McClelland, Felix Hill
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we experimentally manipulated the distributional properties of the training data and measured the effects on in-context few-shot learning. We performed our experiments over data sequences sampled from a standard image-based few-shot dataset (the Omniglot dataset; Lake et al., 2019). |
| Researcher Affiliation | Collaboration | Stephanie C.Y. Chan Adam Santoro Andrew K. Lampinen Jane X. Wang Aaditya K. Singh University College London Pierre H. Richemond James L. Mc Clelland Deep Mind, Stanford University Felix Hill Deep Mind |
| Pseudocode | No | The paper describes its methods and experimental setup in narrative text and diagrams (e.g., Figure 1) but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at: https://github.com/deepmind/emergent_in_context_learning |
| Open Datasets | Yes | To investigate the factors that lead to in-context few-shot learning, we created training and evaluation sequences using the Omniglot dataset (Lake et al., 2019, MIT License), a standard image-label dataset for few-shot learning. |
| Dataset Splits | Yes | We evaluated trained models on two types of sequences, to measure (1) in-context learning and (2) in-weights learning. |
| Hardware Specification | No | The paper does not provide any specific details regarding the hardware specifications (e.g., GPU models, CPU types, memory) used to run the experiments. It only mentions the transformer and recurrent models. |
| Software Dependencies | No | The paper mentions using a ResNet for image embedding and a causal transformer model, citing relevant papers, but does not provide specific software version numbers for libraries or frameworks (e.g., Python, PyTorch, TensorFlow versions) that would be needed for replication. |
| Experiment Setup | Yes | Unless stated otherwise, we used a transformer with 12 layers and embedding size 64. The model was trained on a softmax cross-entropy loss on the prediction for the final (query) image. |