Is In-Context Learning in Large Language Models Bayesian? A Martingale Perspective

Authors: Fabian Falck, Ziyu Wang, Christopher C. Holmes

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In three experiments, we provide evidence for violations of the martingale property, and deviations from a Bayesian scaling behaviour of uncertainty, falsifying the hypothesis that ICL is Bayesian.
Researcher Affiliation Academia 1Department of Statistics, University of Oxford, Oxford, UK. Correspondence to: Fabian Falck <fabian.falck@stats.ox.ac.uk>, Ziyu Wang <ziyu.wang@stats.ox.ac.uk>, Chris Holmes <cholmes@stats.ox.ac.uk>.
Pseudocode No The paper presents theoretical propositions and corollaries, and describes experimental procedures in prose, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes We provide our code base on https://github. com/meta-inf/bayes_icl.
Open Datasets No The paper uses three types of synthetic datasets (Bernoulli, Gaussian, and a synthetic natural language experiment). It describes how these datasets are generated but does not provide a public link, DOI, or specific citation for accessing them.
Dataset Splits No The paper specifies parameters for synthetic data generation (e.g., n samples, m length of paths), and compares LLM behavior to a reference Bayesian model using bootstrap confidence intervals. However, it does not describe traditional training, validation, and test dataset splits for the LLMs themselves, as the experiments focus on probing the behavior of pre-trained LLMs using in-context learning.
Hardware Specification Yes For all Huggingface models, we generated the sampling paths by performing inference on a single A100 Nvidia GPU for each run.
Software Dependencies No The paper lists several software libraries used, including PyTorch, numpy, Huggingface transformers, matplotlib, scikit-learn, and pandas. While it mentions 'Python Library Reference, release 3.8.2' for Python, it generally lacks specific version numbers for the other key software components, which is required for a reproducible description of ancillary software.
Experiment Setup Yes We consider three types of synthetic datasets z1:n: Bernoulli: Zi Bern(θ), where θ {0.3, 0.5, 0.7}; Gaussian: Zi N(θ, 1), where θ { 1, 0, 1}; A synthetic natural language experiment... For the first two experiments we vary n {20, 50, 100}, m {n/2, 2n} and sample J = 200 paths from the LLMs. For the natural language experiments we fix n = 100, m = 50, J = 80. We use the following LLMs: llama-2-7B, mistral-7B, gpt-3, gpt-3.5, and gpt-4.