Meta-in-context learning in large language models

Authors: Julian Coda-Forno, Marcel Binz, Zeynep Akata, Matt Botvinick, Jane Wang, Eric Schulz

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In the present paper, we demonstrate that the in-context learning abilities of large language models can be recursively improved via in-context learning itself. We coin this phenomenon meta-in-context learning. Looking at two idealized domains, a one-dimensional regression task and a two-armed bandit task, we show that meta-in-context learning adaptively reshapes a large language model s priors over expected tasks. Furthermore, we find that meta-in-context learning modifies the in-context learning strategies of such models. Finally, we broaden the scope of our investigation to encompass two diverse benchmarks: one focusing on realworld regression problems and the other encompassing multiple NLP tasks. In both cases, we observe competitive performance comparable to that of traditional learning algorithms.
Researcher Affiliation Collaboration Julian Coda-Forno1,2, Marcel Binz1 Zeynep Akata2 1Max Planck Institute for Biological Cybernetics, 2University of Tübingen Tübingen, Germany; 3Google Deep Mind London, United-Kingdom Matthew Botvinick3 Jane X. Wang3 Eric Schulz1
Pseudocode No The paper describes experimental procedures and prompts in text and with diagrams (e.g., Figure 1 provides a high-level overview), but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code No The paper mentions using the 'public Open AI Python API [22]' (platform.openai.com) and refers to open-source models like Falcon-40b [23], Llama-2 [24], and mpt-30b [25], but it does not provide any link or explicit statement for the availability of the authors' own source code used for the described methodology or experiments.
Open Datasets Yes We considered a multi-dimensional regression benchmark which consists of 60 different real-world datasets introduced in [28]. Finally, we examined whether meta-in-context learning also improves upon in-context-learning on standard natural language processing tasks. To test this, we conducted an experiment on the Massive Multitask Language Understanding (MMLU) benchmark [29].
Dataset Splits No The paper does not provide specific percentages or sample counts for traditional train/validation/test dataset splits. It describes how data points were sampled and used in prompts for in-context learning, but not a fixed partitioning of a dataset into training, validation, and testing sets in the conventional sense for model training.
Hardware Specification No The paper states, 'We used the public Open AI Python API [22] to run all our simulations. This API provides access to several LLMs from the Generative Pre-trained Transformer (GPT) family. We ran all our simulations on the TEXT-DAVINCI-002 model, which is also known as GPT-3.' This indicates they used an API for remote execution and does not specify any local hardware components (like specific GPU/CPU models or memory) used to run their experiments.
Software Dependencies No The paper mentions using the 'public Open AI Python API [22]' and the 'TEXT-DAVINCI-002 model', which are services/models. While 'statsmodels' [32] is cited, it's not explicitly stated with a version number as a dependency for their core experimental setup, nor are other libraries or programming language versions given to ensure reproducibility of their specific code environment.
Experiment Setup Yes We set the temperature parameter to zero (leading to deterministic responses) unless otherwise noted and retained the default values for all other parameters. Inputs xt were sampled from U(0, 100) and the trial-specific additive noise εt was sampled from N(0, 1). For each task, we sampled independent mean rewards for each machine from N(0, 64). The actually obtained reward is generated using the mean reward that corresponds to the chosen machine plus some additive Gaussian noise sampled from N(0, 32). We used five features for all tasks. ... To maintain a consistent regression loss across all tasks, we normalized both the feature and target spaces to the interval of [ 1, 1]. For the in-context learning simulations, we provided the model with k {0, 1, 2} examples from the same category before prompting it on the test question. For the meta-in-context learning simulations, we additionally prepended three examples of two tasks from different categories.