reproducibilityindex.ai

Is In-Context Learning in Large Language Models Bayesian? A Martingale Perspective

Authors: Fabian Falck, Ziyu Wang, Christopher C. Holmes

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In three experiments, we provide evidence for violations of the martingale property, and deviations from a Bayesian scaling behaviour of uncertainty, falsifying the hypothesis that ICL is Bayesian.
Researcher Affiliation	Academia	1Department of Statistics, University of Oxford, Oxford, UK. Correspondence to: Fabian Falck <fabian.falck@stats.ox.ac.uk>, Ziyu Wang <ziyu.wang@stats.ox.ac.uk>, Chris Holmes <cholmes@stats.ox.ac.uk>.
Pseudocode	No	The paper presents theoretical propositions and corollaries, and describes experimental procedures in prose, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We provide our code base on https://github. com/meta-inf/bayes_icl.
Open Datasets	No	The paper uses three types of synthetic datasets (Bernoulli, Gaussian, and a synthetic natural language experiment). It describes how these datasets are generated but does not provide a public link, DOI, or specific citation for accessing them.
Dataset Splits	No	The paper specifies parameters for synthetic data generation (e.g., n samples, m length of paths), and compares LLM behavior to a reference Bayesian model using bootstrap confidence intervals. However, it does not describe traditional training, validation, and test dataset splits for the LLMs themselves, as the experiments focus on probing the behavior of pre-trained LLMs using in-context learning.
Hardware Specification	Yes	For all Huggingface models, we generated the sampling paths by performing inference on a single A100 Nvidia GPU for each run.
Software Dependencies	No	The paper lists several software libraries used, including PyTorch, numpy, Huggingface transformers, matplotlib, scikit-learn, and pandas. While it mentions 'Python Library Reference, release 3.8.2' for Python, it generally lacks specific version numbers for the other key software components, which is required for a reproducible description of ancillary software.
Experiment Setup	Yes	We consider three types of synthetic datasets z1:n: Bernoulli: Zi Bern(θ), where θ {0.3, 0.5, 0.7}; Gaussian: Zi N(θ, 1), where θ { 1, 0, 1}; A synthetic natural language experiment... For the first two experiments we vary n {20, 50, 100}, m {n/2, 2n} and sample J = 200 paths from the LLMs. For the natural language experiments we fix n = 100, m = 50, J = 80. We use the following LLMs: llama-2-7B, mistral-7B, gpt-3, gpt-3.5, and gpt-4.