reproducibilityindex.ai

An Explanation of In-context Learning as Implicit Bayesian Inference

Authors: Sang Michael Xie, Aditi Raghunathan, Percy Liang, Tengyu Ma

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Beyond the theory, experiments on GINC exhibit large-scale real-world phenomena including improved in-context performance with model scaling (despite the same pretraining loss), sensitivity to example order, and instances where zero-shot is better than few-shot in-context learning. 4 SIMULATIONS We generate the GINC dataset and show that Transformers (Vaswani et al., 2017) and LSTMs (Hochreiter & Schmidhuber, 1997) trained on GINC exhibit incontext learning.
Researcher Affiliation	Academia	Sang Michael Xie, Aditi Raghunathan, Percy Liang, Tengyu Ma Stanford University {xie,aditir,pliang,tengyuma}@cs.stanford.edu
Pseudocode	No	The paper does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures.
Open Source Code	Yes	1The code, data, and experiments are located on Git Hub and Coda Lab.
Open Datasets	Yes	1The code, data, and experiments are located on Git Hub and Coda Lab. GINC dataset. We construct the GINC dataset according to our theory (see Appendix F.1).
Dataset Splits	Yes	The dataset contains 1000 training documents and 100 validation documents, where training documents have 10240 tokens and validation documents have 1024 tokens.
Hardware Specification	Yes	The hardware was mainly Titan Xp GPUs, trained and evaluated using 16-bit precision.
Software Dependencies	No	The paper mentions using 'Hugging Face library' and the 'Adam W optimizer' but does not provide specific version numbers for these software components or any other libraries like PyTorch or TensorFlow.
Experiment Setup	Yes	Our Transformer models are based on the GPT-2 architectures with 4, 12, and 16 layers respectively, with 12 attention heads, 768 dimensional embeddings, residual/embedding/attention dropout set to 0.1, and a context window of 1024. We train for 5 epochs using the Adam W optimizer... with a batch size of 8 and a linear learning rate schedule (with 1000 step warmup) up to a learning rate of 8e-4...