An Explanation of In-context Learning as Implicit Bayesian Inference

Authors: Sang Michael Xie, Aditi Raghunathan, Percy Liang, Tengyu Ma

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Beyond the theory, experiments on GINC exhibit large-scale real-world phenomena including improved in-context performance with model scaling (despite the same pretraining loss), sensitivity to example order, and instances where zero-shot is better than few-shot in-context learning. 4 SIMULATIONS We generate the GINC dataset and show that Transformers (Vaswani et al., 2017) and LSTMs (Hochreiter & Schmidhuber, 1997) trained on GINC exhibit incontext learning.
Researcher Affiliation Academia Sang Michael Xie, Aditi Raghunathan, Percy Liang, Tengyu Ma Stanford University {xie,aditir,pliang,tengyuma}@cs.stanford.edu
Pseudocode No The paper does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures.
Open Source Code Yes 1The code, data, and experiments are located on Git Hub and Coda Lab.
Open Datasets Yes 1The code, data, and experiments are located on Git Hub and Coda Lab. GINC dataset. We construct the GINC dataset according to our theory (see Appendix F.1).
Dataset Splits Yes The dataset contains 1000 training documents and 100 validation documents, where training documents have 10240 tokens and validation documents have 1024 tokens.
Hardware Specification Yes The hardware was mainly Titan Xp GPUs, trained and evaluated using 16-bit precision.
Software Dependencies No The paper mentions using 'Hugging Face library' and the 'Adam W optimizer' but does not provide specific version numbers for these software components or any other libraries like PyTorch or TensorFlow.
Experiment Setup Yes Our Transformer models are based on the GPT-2 architectures with 4, 12, and 16 layers respectively, with 12 attention heads, 768 dimensional embeddings, residual/embedding/attention dropout set to 0.1, and a context window of 1024. We train for 5 epochs using the Adam W optimizer... with a batch size of 8 and a linear learning rate schedule (with 1000 step warmup) up to a learning rate of 8e-4...